I am sharing the slides I used for teaching my "Data Science by R" class. You can sign up a class at http://www.nycdatascience.com/ ----NYC Data Science Academy. We offer classes in R, Python, Processing, D3.js, Hadoop, and etc.
r for data science 2. grammar of graphics (ggplot2) clean -refMin-hyung Kim
REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University
Data visualization with multiple groups using ggplot2Rupak Roy
Well-documented visualization using geom_histogram(), facet(), geom_density(),
geom_boxplot(), geom_bin2d() and much more. Let me know if anything is required. Ping me @ google #bobrupakroy
Learn the basics of data visualization in R. In this module, we explore the Graphics package and learn to build basic plots in R. In addition, learn to add title, axis labels and range. Modify the color, font and font size. Add text annotations and combine multiple plots. Finally, learn how to save the plots in different formats.
r for data science 2. grammar of graphics (ggplot2) clean -refMin-hyung Kim
REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University
Data visualization with multiple groups using ggplot2Rupak Roy
Well-documented visualization using geom_histogram(), facet(), geom_density(),
geom_boxplot(), geom_bin2d() and much more. Let me know if anything is required. Ping me @ google #bobrupakroy
Learn the basics of data visualization in R. In this module, we explore the Graphics package and learn to build basic plots in R. In addition, learn to add title, axis labels and range. Modify the color, font and font size. Add text annotations and combine multiple plots. Finally, learn how to save the plots in different formats.
Data visualization using the grammar of graphicsRupak Roy
Well-documented data visualization using ggplot2, geom_density2d, stat_density_2d, geom_smooth, stat_ellipse, scatterplot and much more. Let me know if anything is required. Ping me at google #bobrupakroy
DocEng2010 Bilauca Healy - A New Model for Automated Table Layoutmbilauca
In this presentation we consider the table layout problem. We present a combinatorial optimization modeling method for the table layout optimization problem, the problem of minimizing a table’s height subject to it fitting on a given page (width). We present two models of the problem and report on their evaluation. http://doi.acm.org/10.1145/1860559.1860594
The tech talk was given by Kexin Xie, Director of Data Science, and Yacov Salomon, VP of Data Science in June 2017.
Scaling up data science applications: How switching to Spark improved performance, realizability and reduced cost
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Formations & Deformations of Social Network GraphsShalin Hai-Jew
Social network graphs are node-link (vertex-edge; entity-relationship) diagrams that show relationships between people and groups. Open-source tools like NodeXL Basic (available on Microsoft’s CodePlex) enable the capture of network data from select social media platforms through third-party add-ons and social media APIs. From social groups, relational clusters are extracted with clustering algorithms which identify intensities of connections. Visually, structural relational data is conveyed with layout algorithms in two-dimensional space. Using these various layout options and built-in visual design features, it is possible to aesthetically “deform” the network graph data for visual effects. This presentation introduces novel datasets and novel data visualizations.
R visualization: ggplot2, googlevis, plotly, igraph OverviewOlga Scrivner
In this workshop you will learn about 4 R packages to perform data visualization: ggplot2, googlevis, plotly and igraph. You will learn about their strengths and weaknesses. Code snippets are provides.
Balistrocchi, M., Metulini, R., Carpita, M., and Ranzi, R.: Dynamic maps of human exposure to floods based on mobile phone data, Nat. Hazards Earth Syst. Sci. Discuss., https://doi.org/10.5194/nhess-2020-201, in press, 2020
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Data visualization using the grammar of graphicsRupak Roy
Well-documented data visualization using ggplot2, geom_density2d, stat_density_2d, geom_smooth, stat_ellipse, scatterplot and much more. Let me know if anything is required. Ping me at google #bobrupakroy
DocEng2010 Bilauca Healy - A New Model for Automated Table Layoutmbilauca
In this presentation we consider the table layout problem. We present a combinatorial optimization modeling method for the table layout optimization problem, the problem of minimizing a table’s height subject to it fitting on a given page (width). We present two models of the problem and report on their evaluation. http://doi.acm.org/10.1145/1860559.1860594
The tech talk was given by Kexin Xie, Director of Data Science, and Yacov Salomon, VP of Data Science in June 2017.
Scaling up data science applications: How switching to Spark improved performance, realizability and reduced cost
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Formations & Deformations of Social Network GraphsShalin Hai-Jew
Social network graphs are node-link (vertex-edge; entity-relationship) diagrams that show relationships between people and groups. Open-source tools like NodeXL Basic (available on Microsoft’s CodePlex) enable the capture of network data from select social media platforms through third-party add-ons and social media APIs. From social groups, relational clusters are extracted with clustering algorithms which identify intensities of connections. Visually, structural relational data is conveyed with layout algorithms in two-dimensional space. Using these various layout options and built-in visual design features, it is possible to aesthetically “deform” the network graph data for visual effects. This presentation introduces novel datasets and novel data visualizations.
R visualization: ggplot2, googlevis, plotly, igraph OverviewOlga Scrivner
In this workshop you will learn about 4 R packages to perform data visualization: ggplot2, googlevis, plotly and igraph. You will learn about their strengths and weaknesses. Code snippets are provides.
Balistrocchi, M., Metulini, R., Carpita, M., and Ranzi, R.: Dynamic maps of human exposure to floods based on mobile phone data, Nat. Hazards Earth Syst. Sci. Discuss., https://doi.org/10.5194/nhess-2020-201, in press, 2020
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
The goal of this work is to advance our understanding of what new can be learned about crypto-tokens by analyzing the topological structure of the Ethereum transaction network. By introducing a novel combination of tools from topological data analysis and functional data depth into blockchain data analytics, we show that Ethereum network can provide critical insights on price strikes of crypto-token prices that are otherwise largely inaccessible with conventional data sources and traditional analytic methods.
Visualization of multidimensional multi factorial big data is not large data, big data is complex data.We are trainnig decipher this complexcity data Visualization.
Data Visualization packages of R software lattice and ggplot 2.
Graphical Data-Mining Analysis With R Software
This document list the reasons why our past alumni chose NYC Data Science Academy over other programs.
Machine Learning Bootcamp is our flagship program and well received by our community.
This project was completed by Scott Dobbins and Rachel Kogan, who enrolled in the NYC Data Science Academy's 12-Week Data Science Bootcamp. Learn more about the program: http://nycdatascience.com/data-science-bootcamp/
Given that both Wikipedia and comments sections of most websites are freely open to anyone to edit at any time, how has Wikipedia managed to remain such a useful resource while most comments sections are ridden with vandalism, ads, and other counterproductive user behavior?
We believe the answer is two-fold: 1) Wikipedia has an army of bots that quickly identify and revert vandalism so that the worst edits are usually never seen by people and the site generally maintains itself in a well-kempt state, and 2) Wikipedia has a strong community of administrators and other contributors who routinely clean the site’s flagged contents.
Vandalism is relatively easy to flag, though a few clever edits manage to stay on the site for a long time. What about site content problems that are more subjective, like bias? Wikipedia users do routinely manually flag pages with point-of-view (POV) issues, though with millions of pages and no machine-based approaches, the site can only manage to confidently maintain neutrality on the more well-trafficked pages.
Here we propose a solution to solve some of the more intractable content issues for Wikipedia and other sites using Natural Language Processing (NLP) and machine learning approaches. The sheer quantity of data managed by Wikipedia and similar sites requires distributed computing approaches, so we show here how Apache Spark can upgrade common algorithms to run on massive data sets.
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
Developed by Chao Shi, Sam O'Mullane, Sean Kickham, Reza Rad and Andrew Rubino
Watch the project presentation: https://youtu.be/gkKGnnBenyk
This project was completed by students from NYC Data Science Academy's 12-Week Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
People make decisions on where to eat based on friends’ recommendations. Since they know you, their suggestions matter more than those of strangers.
For the capstone project, we built a hybrid Yelp recommendation system that can provide individualized recommendations based on your friend’s reviews on the social network. We built the machine learning models using Spark, and set up a Flask-Kafka-RDS-Databricks pipeline that allows a continuous stream of user requests.
During the presentation, we will talk about the development framework and technical implementation of the pipeline.
Read on their project posts and code:
https://blog.nycdatascience.com/student-works/capstone/yelp-recommender-part-1/
https://blog.nycdatascience.com/student-works/yelp-recommender-part-2/
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
This project was completed by students graduated from NYC Data Science Academy 12-week Data Science Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
Watch the project presentation: https://youtu.be/W530d2ZdbJE
Ranked #15 out of 3,274 teams on Kaggle Team Members - Brandy Freitas, Chase Edge and Grant Webb
Given 4 years of housing price data in a foreign market, predicting the following year’s prices should be pretty straightforward, right? But what if in that last year of data, the country’s stock market, the value of its currency and the price of its number 1 export, all dropped by nearly 50%. And on top of all that, the country was slapped with economic sanctions by the EU and the US. This was Moscow in 2014 and as you can see, it was anything but straightforward.
We were able to overcome these challenges and in the two weeks of working together, were able to achieve a top 1% ranking on Kaggle. Our success is a product of our in depth data cleaning, feature engineering and our approach to modeling. With a focus on interpretability and simplicity, we begin modeling using linear regression and decision trees which gave us a better understanding of the data. We then utilized more complicated models such as random forests and XGBoost which ultimately resulted in our top submission.
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
Twitter: @NycDataSci
Learn with our NYC Data Science Program (weekend courses for working professionals and 12 week full time for whom are advancing their career into Data Science)
Our next 12-Week Data Science Bootcamp starts in Jun. (Deadline to apply is May 1st, all decisions will be made by May 15th.)
====================================
Max Kuhn, Director is Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling.
He will join us and share his experience with Data Mining with R.
Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at ttp://appliedpredictivemodeling.com/blog
---------------------------------------------------------
His Feb 18th course can be RSVP at NYC Data Science Academy.
Syllabus
Predictive Modeling using R
Description
This class will get attendees up to speed in predictive modeling using the R programming language. The goal of the course is to understand the general predictive modeling process and how it can be implemented in R. A selection of important models (e.g. tree-based models, support vector machines) will be described in an intuitive manner to illustrate the process of training and evaluating models.
Prerequisites:
Attendees should have a working knowledge of basic R data structures (e.g. data frames, factors etc) and language fundamentals such as functions and subsetting data. Understanding of the content contained in Appendix B sections B1 though B8 of Applied Predictive Modeling (free PDF from publisher [1]) should suffice.
Outline:
- An introduction to predictive modeling
- R and predictive modeling: the good and bad
- Illustrative example
- Measuring performance
- Data splitting and resampling
- Data pre-processing
- Classification trees
- Boosted trees
- Support vector machines
If time allows, the following topics will also be covered
- Parallel processing
- Comparing models
- Feature selection
- Common pitfalls
Materials:
Attendees will be provided with a copy of Applied Predictive Modeling[2] as well as course notes, code and raw data. Participants will be able to reproduce the examples described in the workshop.
Attendees should have a computer with a relatively recent version of R installed.
About the Instructor:
More about Max's work:
[1] http://rd.springer.com/content/pdf/bbm%3A978-1-4614-6849-3%2F1.pdf
[2] http://appliedpredictivemodeling.com
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014.
Speaker:
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
Topic:
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
2. Data Visualization
http://nycdatascience.com/part4_en/
Data visualization
We will study the application of primary drawing functions and advanced drawing functions in R and
will focus on understanding the methods of data exploration by visualization.
· The related functions in R
· The properties of a single variable
· Displaying compositions
· The relationship between variables
· Exhibiting change over time
· Geographic information
Case study and excercise: Analyzing the NBA data with graphics
2 of 98
2/4/14, 7:31 AM
4. Data Visualization
http://nycdatascience.com/part4_en/
Data visualization
A figure is worth a thousand words.
data <- read.table('data/anscombe.txt',T)
data <- data[,-1]
head(data)
1
2
3
4
5
6
4 of 98
x1
10
8
13
9
11
14
x2
10
8
13
9
11
14
x3 x4
y1
y2
y3
y4
10 8 8.04 9.14 7.46 6.58
8 8 6.95 8.14 6.77 5.76
13 8 7.58 8.74 12.74 7.71
9 8 8.81 8.77 7.11 8.84
11 8 8.33 9.26 7.81 8.47
14 8 9.96 8.10 8.84 7.04
2/4/14, 7:31 AM
5. Data Visualization
http://nycdatascience.com/part4_en/
Data visualization
Try to calculate some statistical indicators. First calculate the mean of these datasets, and then
calculate the correlation coefficient of the four groups of data
colMeans(data)
x1 x2 x3 x4 y1 y2 y3 y4
9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5
sapply(1:4,function(x) cor(data[,x],data[,x+4]))
[1] 0.816 0.816 0.816 0.817
5 of 98
2/4/14, 7:31 AM
7. Data Visualization
http://nycdatascience.com/part4_en/
Some basic principles
1. Determine the target of visualization from the beginning
· Exploratory visualization
· Explanatory visualization
2. Understanding the characteristics of the data and the audience
· Which variables are important and interesting
· Consider the role and background of the audience
· Select a proper mapping
3. Keep concise but give enough information
7 of 98
2/4/14, 7:31 AM
35. Data Visualization
http://nycdatascience.com/part4_en/
ggplot package
Polishing your plots for publication
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1) +
opts(title='Vehicle model and fuel consumption') +
labs(y='Highway miles per gallon',
x='Urban miles per gallon',
size='Displacement',
colour = 'Model')
35 of 98
2/4/14, 7:31 AM
40. Data Visualization
http://nycdatascience.com/part4_en/
Histogram
We can customize the histogram as follows:
p <- ggplot(iris,aes(x=Sepal.Length))+
geom_histogram(binwidth=0.1,
# Set the group gap
fill='skyblue', # Set the fill color
colour='black') # Set the border color
40 of 98
2/4/14, 7:31 AM
42. Data Visualization
http://nycdatascience.com/part4_en/
Histograms plus density curve
The main role of the histogram of is to show counting by groups and distribution characteristics. The
distribution of a sample in traditional statistics is of important significance. But there is another
method that can also show the distribution of data, namely the kernel density estimation curve. We
can estimate a density curve that represents the distribution, according to the data. We can display
the histogram and density curve at the same time.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..),
fill='skyblue',
color='black') +
geom_density(color='black',
linetype=2,adjust=2)
42 of 98
2/4/14, 7:31 AM
44. Data Visualization
http://nycdatascience.com/part4_en/
Density curve
Similar to the window width parameter, the adjust parameter will control the presentation of the
density curve. We try different parameters to draw mutiple density curves. The smaller the parameter
is, the more volatile and sensitive the curve is.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..), # Note: set y to relative frequency
fill='gray60',
color='gray') +
geom_density(color='black',linetype=1,adjust=0.5) +
geom_density(color='black',linetype=2,adjust=1) +
geom_density(color='black',linetype=3,adjust=2)
44 of 98
2/4/14, 7:31 AM
46. Data Visualization
http://nycdatascience.com/part4_en/
Density curve
Density curve is also convenient for comparison between different data. For example, we want to
compare the Sepal.Length distribution of three different flowers of the iris, like this:
p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray')
print(p)
46 of 98
2/4/14, 7:31 AM
47. Data Visualization
http://nycdatascience.com/part4_en/
Boxplot
In addition to the histograms and density map, We can also use boxplots to show the distribution of
one-dimensional data. The boxplot is also convenient for comparison of different data.
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot()
print(p)
47 of 98
2/4/14, 7:31 AM
48. Data Visualization
http://nycdatascience.com/part4_en/
Violin plot
A violin plot contains more information than a boxplot about the (sub-)distributions of the data:
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin()
print(p)
48 of 98
2/4/14, 7:31 AM
52. Data Visualization
http://nycdatascience.com/part4_en/
Stacked bar chart
The proportion of each vehicle model in the mpg dataset and these proportions grouped by years
mpg$year <- factor(mpg$year)
p <- ggplot(mpg,aes(x=class,fill=year)) +
geom_bar(color='black')
52 of 98
2/4/14, 7:31 AM
58. Data Visualization
http://nycdatascience.com/part4_en/
Rose diagram
Wind rose, a commonly used graphics tool by meteorologists, describes the wind speed and
direction distributions in a specific place.
set.seed(1)
# Randomly generate 100 wind directions, and divide them into 16 intervals.
dir <- cut_interval(runif(100,0,360),n=16)
# Randomly generate 100 wind speed, and divide them into 4 intensities.
mag <- cut_interval(rgamma(100,15),4)
sample <- data.frame(dir=dir,mag=mag)
# Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coo
p <- ggplot(sample,aes(x=dir,fill=mag)) +
geom_bar()+ coord_polar()
58 of 98
2/4/14, 7:31 AM
62. Data Visualization
http://nycdatascience.com/part4_en/
The proportion structure of continuous data
data <- read.csv('data/soft_impact.csv',T)
library(reshape2)
data.melt <- melt(data,id='Year')
p <- ggplot(data.melt,aes(x=Year,y=value,
group=variable,fill=variable)) +
geom_area(color='black',size=0.3,
position=position_fill()) +
scale_fill_brewer()
62 of 98
2/4/14, 7:31 AM
67. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
Represent different years with different shapes
mpg$year <- factor(mpg$year)
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year))
print(p)
67 of 98
2/4/14, 7:31 AM
68. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
With large data sets, the points in a scatter plot may obscure each other due to overplotting, we can
make some random disturbance to solve this problem.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position =
print(p)
68 of 98
2/4/14, 7:31 AM
69. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
For the trend of the scatterplot, we can draw out the regression line.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(color=year),alpha=0.5,position = "jitter") +
geom_smooth(method='lm')
print(p)
69 of 98
2/4/14, 7:31 AM
70. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
In addition to color, We can also use the size of the dot to reflect another variable, such as the size
of the cylinder. Some refer to plots like this as "bubble charts".
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") +
geom_smooth(method='lm') +
scale_size_continuous(range = c(4, 10))
70 of 98
2/4/14, 7:31 AM
72. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
Although we can show all the variables in a picture, we can also split it into multiple pictures to show
the characteristics of different variables. This method is called grouping, conditioning, or faceting.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1)
72 of 98
2/4/14, 7:31 AM
74. Data Visualization
http://nycdatascience.com/part4_en/
ggplot exercise II
· make scatter plot for diamond data
· use transparency and small size points, look into size and alpha option in geom_point()
· use bin chart to observe intensity of points,look into stat_bin2d()
· estimate
data
dentisy,look
into
stat_density2d()
and
use
+cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000))
74 of 98
2/4/14, 7:31 AM
78. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
The typical scatter plot is to show a relationship between two variables. When you want to look at
many bivariate relationships at once, you can use a scatter plot matrix.
78 of 98
2/4/14, 7:31 AM
81. Data Visualization
http://nycdatascience.com/part4_en/
Change over time
For visualization of time series data, the first step is looking at how the variable changes over time.
For example, we'll have a look at American employment GDP data visualization.
fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4')
p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) +
geom_bar(stat='identity',
fill=fillcolor)
81 of 98
2/4/14, 7:31 AM
83. Data Visualization
http://nycdatascience.com/part4_en/
Change over time
For the time series of small amount of data, we can use the bar graph to display. At the same time
display the number of positive and negative values with different colors.For the time series of large
scale data, the bar will be crowded, and lines and points can be used to represent the strip.
p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color='grey20',size=0.5) +
geom_point(aes(y=psavert),color='red4') +
theme_bw()
83 of 98
2/4/14, 7:31 AM
85. Data Visualization
http://nycdatascience.com/part4_en/
Change over time
When the data is more intensive, we can use line graph or area chart to show the change of a trend.
Also, some important time points or time interval can be marked in the time series graph, such as
marking 80's as a key time.
fill.color <- ifelse(economics$date > '1980-01-01' &
economics$date < '1990-01-01',
'steelblue','red4')
p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color=fill.color,size=0.9) +
geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") +
theme_bw()
85 of 98
2/4/14, 7:31 AM
89. Data Visualization
http://nycdatascience.com/part4_en/
Map
Two types of drawing map
· Download the geographic information data, and then draw the geographical boundaries, and
identify areas and locations according to the need
· Download bitmap data of Google map, and then mark the location and path information on the
google map
89 of 98
2/4/14, 7:31 AM
94. Data Visualization
http://nycdatascience.com/part4_en/
Drawing a map of China based on a bitmap
Another method to drawing China map is to download a document containing bitmap data from
Google or openstreetmap, and then to overlap points and lines elements on it with ggplot2. This
document does not include information of latitude and longitude, just a simple bitmap, for fast
mapping.
library(ggmap)
library(XML)
webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'
tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)
raw <- tables[[6]]
data <- raw[,c(1,3,4)]
names(data) <- c('date','lan','lon')
data$lan <- as.numeric(data$lan)
data$lon <- as.numeric(data$lon)
data$date <- as.Date(data$date, "%Y-%m-%d")
#Read the map data from Google by the ggmap package, and mark the previous data on the map.
earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device'
geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+
theme(legend.position = "none")
94 of 98
2/4/14, 7:31 AM
96. Data Visualization
http://nycdatascience.com/part4_en/
R and interactive visualization
GoogleVis is R package providing a interface between R and Google visualization API. It allows the
user to use the Google Visualization API for data visualization without the need to upload data.
We want to compare the development trajectory of 20 country group over the past several years. In
order to obtain the data, we selected three variables from the world bank database, which reflect the
change of GDP, CO2 emissions and life expectancy between 2001 to 2009.
library(googleVis)
library(WDI)
DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US'
M <- gvisMotionChart(DF, idvar="country", timevar="year",
xvar='EN.ATM.CO2E.KT',
yvar='NY.GDP.MKTP.CD')
plot(M)
96 of 98
2/4/14, 7:31 AM
98. Data Visualization
http://nycdatascience.com/part4_en/
Exercise III: Analyzing NBA data
· Calculate the seasonal winning rate, and draw a bar chart
· Calculating the seasonal winning rate at home and on the road, and draw a bar chart
· According to the seasonal scores of home side, draw a set of four histograms
· According to the seasonal scores of home side,draw the boxplots of five seasons
· Draw the boxplots of scores of all competitions for home side and opposite side
· Calculate the average and winning percentage for each opponent, and make a scatterplot to find
the strong and the weak team.
98 of 98
2/4/14, 7:31 AM