Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Machine Learning and Data Mining: 15 Data Exploration and PreparationPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the data exploration and preparation.
Video game design and programming course for the Master in Computer Engineering at the Politecnico di Milano.
http://www.facebook.com/polimigamecollective
https://twitter.com/@POLIMIGC
http://www.youtube.com/PierLucaLanzi
http://www.polimigamecollective.org
Politecnico di Milano, Videogiochi, Video Games, Computer Engineering
Abstract—In evolutionary high-level synthesis, design solutions
have to be evaluated to extract information about some
figures of merit (such as performance, area, etc.) and to allow
the genetic algorithm to evolve and converge to Pareto-optimal
solutions. Since the execution time of such evaluations increases
with the complexity of the specification, this could lead to
unacceptable execution time of the overall methodology. This
paper presents a model to exploit fitness inheritance in a multiobjective
optimization algorithm (i.e. NSGA-II [1]) by substituting
the expensive real evaluations with an estimation based
on neighbors in an hypothetical design space. The estimations
are based on a measure of distance between individuals and
a weighted average on fitnesses of closer ones. The results
shows that the Pareto-optimal set obtained by applying the
proposed model good approximates the set obtained without
fitness inheritance and overall execution time is reduced more
than 25% in average.
From Research Objects to Reproducible Science TalesBertram Ludäscher
University of Southampton. Electronics & Computer Science. Research Seminar (Invited Talk).
TITLE: From Research Objects to Reproducible Science Tales
ABSTRACT. Rumor has it that there is a reproducibility crisis in science. Or maybe there are multiple crises? What do we mean by reproducibility and replicability anyways? In this talk I will first make an attempt at sorting out some of the terminological confusion in this area, focusing on computational aspects. The PRIMAD model is another attempt to describe different aspects of reproducibility studies by focusing on the "delta" between those studies and the original study. In addition to these more theoretical investigations, I will discuss practical efforts to create more reproducible and more transparent computational platforms such as the one developed by the Whole-Tale project: here 'tales' are executable research objects that may combine data, code, runtime environments, and narratives (i.e., the traditional "science story"). I will conclude with some thoughts about the remaining challenges and opportunities to bridge the large conceptual gaps that continue to exist despite the recognition of problems of reproducibility and transparency in science.
ABOUT the Speaker. Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the Department of Computer Science at Illinois. Until 2014 he was a professor at the Department of Computer Science at the University of California, Davis. His research interests range from practical questions in scientific data and workflow management, to database theory and knowledge representation and reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the University of Karlsruhe (now K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, in Germany.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
MLPI Lecture 1: Maths for Machine LearningDahua Lin
This lecture covers several mathematical subjects that are widely used in machine learning, which particularly include general topology, functional analysis, and measure theory.
A talk given at the annual Computer Science for High School Teachers event at Victoria University of Wellington. I presented on some basics of the World Wide Web and why it's worth to preserve it, our work on non-expert tools to populate semantically enriched content, a current project to identify NZ native birds based on their calls that involves citizen science and contemporary deep learning using TensorFlow, a project that investigates the impact of online citizen science on the development of science capabilities of primary school children, and my collaboration with Adam Grener from the School of English, Film, Theater and Media Studies at VUW with whom I am working on computational tools for the literature studies.
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
"Towards a Science of Reproducible Science?" DPRMA Workshop talk at JCDL 2013, Indianapolis, 25th July 2013. Workshop website is http://dprma.oerc.ox.ac.uk/
Paper is
David De Roure. 2013. Towards computational research objects. In Proceedings of the 1st International Workshop on Digital Preservation of Research Methods and Artefacts (DPRMA '13). ACM, New York, NY, USA, 16-19. DOI=10.1145/2499583.2499590 http://doi.acm.org/10.1145/2499583.2499590
Interpretation, Context, and Metadata: Examples from Open ContextEric Kansa
Presentation given at the International Data Curation Conference (#IDCC!6) in Amsterdam, at the "A Context-driven Approach to Data Curation for Reuse" workshop (organized by Ixchel Faniel and Elizabeth Yakel) on Monday, February 22, 2015
YesWorkflow: More Provenance Mileage from Scientific Workflows and Scripts. Keynote at WORKS 2015: Workshop
Workflows in Support of Large-Scale Science. Sunday Nov. 15, 2015, Austin, Texas.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Slides from the 2016/2017 edition of the Video game Design and Programming course at the Politecnico di Milano. More information at http://www.polimigamecollective.org Some of the video games developed by the students during the course are available at https://polimi-game-collective.itch.io
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
1. Prof. Pier Luca Lanzi
Data Exploration!
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
2. Prof. Pier Luca Lanzi
Readings
• “Data Mining and Analysis” – Chapter 2 & 3
• Loading Data and Basic Formatting in R!
http://flowingdata.com/2015/02/18/loading-data-and-basic-formatting-in-r/
2
3. Prof. Pier Luca Lanzi
before trying anything,
you should know your data!
4. Prof. Pier Luca Lanzi
What is Data Exploration?
• It is the preliminary exploration of the data aimed at identifying
their most relevant characteristics
• What the key motivations?
! Helping to select the right tool for preprocessing and data
mining
! Exploiting humans’ abilities to recognize patterns!
not captured by automatic tools
• Related to Exploratory Data Analysis (EDA), created by
statistician John Tukey
4
5. Prof. Pier Luca Lanzi
Exploratory Data Analysis
• “An approach of analyzing data to summarize their main
characteristics without using a statistical model or having
formulated a prior hypothesis.”
• “Exploratory data analysis was promoted by John Tukey
to encourage statisticians to explore the data, and
possibly formulate hypotheses that could lead to new
data collection and experiments.”!
• Seminal book is “Exploratory Data Analysis” by Tukey
• Nice online introduction in Chapter 1 of the NIST
Engineering Statistics Handbook !
http://www.itl.nist.gov/div898/handbook/index.htm
• http://en.wikipedia.org/wiki/Exploratory_data_analysis!
5
6. Prof. Pier Luca Lanzi
http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf
7. Prof. Pier Luca Lanzi
What Data Exploration Techniques?
• Exploratory Data Analysis (as originally defined by Tukey),!
was mainly focused on
! Visualization
! Clustering and anomaly detection!
(viewed as exploratory techniques)
! In data mining, clustering and anomaly detection are major
areas of interest, and not thought of as just exploratory
• In this section, we focus on data exploration using
! Summary statistics
! Visualization
• We will come back later to data exploration when discussing
clustering
7
18. Prof. Pier Luca Lanzi
Geometric Interpretation of Correlation
• The correlation coefficient is simply the cosine of the angle
between the two centered attribute vectors
20
22. Prof. Pier Luca Lanzi
Summary Statistics
• They are numbers that summarize properties of the data
• Summarized properties include location, mean, spread, skewness,
standard deviation
• Most summary statistics can be calculated in a single pass
37
23. Prof. Pier Luca Lanzi
Frequency and Mode
• The frequency of an attribute value is the percentage of time the
value occurs in the data set
• For example, given the attribute ‘gender’ and a representative
population of people, the gender ‘female’ occurs about 50% of
the time.
• The mode of an attribute is the most frequent attribute value
• The notions of frequency and mode are typically used with
categorical data
38
24. Prof. Pier Luca Lanzi
Measures of Location:!
Mean and Median
• The mean is the most common measure of the location of a set
of points
• However, the mean is very sensitive to outliers
• Thus, the median or a trimmed mean is also commonly used
39
25. Prof. Pier Luca Lanzi
Measures of Spread: !
Range and Variance
• Range is the difference between the max and min
• The variance or standard deviation is the most common measure
of the spread of a set of points
• However, this is also sensitive to outliers, so that other measures
are often used
40
26. Prof. Pier Luca Lanzi
Percentiles
• For continuous data, the notion of a percentile is veryuseful
• Given an ordinal or continuous attribute x and a number p
between 0 and 100, the p-th percentile is a value xp of x such
that p% of the observed values of x are less than xp
• For instance, the 50th percentile is the value x50% such that 50%
of all values of x are less than x50%
41
27. Prof. Pier Luca Lanzi
Summary Statistics in R
# stringr is also needed to be installed
# required libraries
library(foreign)
library(rattle)
library(plyr)
# set working directory
setwd("~/Dropbox/Documents/Online/Corsi/DMTM/data")
# load data
wnum = read.arff("weather.numeric.arff")
# check the structure
str(wnum)
# names of the attributes
names(wnum)
# rename the attributes to normalize them in lowercase
names <- normVarNames(names(wnum))
42
28. Prof. Pier Luca Lanzi
Summary Statistics in R
# summary statistics
summary(wnum)
mean(wnum$humidity)
sd(wnum$humidity)
var(wnum$humidity)
median(wnum$humidity)
max(wnum$humidity)
min(wnum$humidity)
# first ten rows
head(wnum,10)
# first 5 rows columns 1 & 3
head(wnum[c(1,3)],5)
# naming the columns
head(wnum[c("humidity", "temperature")])
43
29. Prof. Pier Luca Lanzi
Summary Statistics in R
# sampling ten rows
head(wnum[sample(nrow(wnum), 10),c("humidity", "temperature")])
# selecting the rows based an attribute properties
wnum[wnum$outlook=="sunny",]
# exploring data with sorted attributes
wnum[order(wnum$temperature),]
###
### grouping and summarizing data
###
ddply(snum,"outlook",summarize, max=max(temperature, na.rm=TRUE))
44
30. Prof. Pier Luca Lanzi
Visualization
• Visualization is the conversion of data into a visual or tabular
format so that the characteristics of the data and the relationships
among data items or attributes can be analyzed or reported
• Data visualization is one of the most powerful and appealing
techniques for data exploration
! Humans have a well developed ability to analyze large
amounts of information that is presented visually
! Can detect general patterns and trends
! Can detect outliers and unusual patterns
45
31. Prof. Pier Luca Lanzi
Barplots
• They use horizontal or vertical bars to compare categories.
• One axis shows the compared categories, the other axis
represents a discrete value
• Some bar graphs present bars clustered in groups of more than
one (grouped bar graphs), and others show the bars divided into
subparts to show cumulate effect (stacked bar graphs)
46
32. Prof. Pier Luca Lanzi
Barplot using R
# barplot
barplot(wnum$outlook)
###
### distribution of outlook values
### splitted according to attribute windy
###
counts <- table(wnum$windy,wnum$outlook)
barplot(counts,col=c("darkblue","red"))
# horizontal
barplot(counts,col=c("darkblue","red"), horiz=TRUE)
# grouped
barplot(counts,col=c("darkblue","red"), horiz=TRUE, beside=TRUE)
47
33. Prof. Pier Luca Lanzi
Histograms
• They are a graphical representation of the distribution of data !
introduced by Karl Pearson in 1895
• They estimate the probability distribution of a continuous variables
• They are representations of tabulated frequencies depicted as adjacent
rectangles, erected over discrete intervals (bins). Their areas are proportional
to the frequency of the observations in the interval.
48
34. Prof. Pier Luca Lanzi
Histograms
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
49
Example: Petal Width (10 and 20 bins, respectively)
35. Prof. Pier Luca Lanzi
Histograms in R
# distribution of temperature values
hist(wnum$temperature, xlab="Temperature",
main="Temperature Distribution", col="blue")
# fewer bins
hist(wnum$temperature, breaks=2, xlab="Temperature",
main="Temperature Distribution", col="blue")
# fit the distribution
d <- density(wnum$temperature)
plot(d, main="Temperature Distribution", xlab="Temperature")
polygon(d, col="red", border="blue")
# difference in bins the distribution
hist(iris$petalwidth, col="blue", breaks=10, xlab="Petal Width", main="")
hist(iris$petalwidth, col="blue", breaks=20, xlab="Petal Width", main="")
d <- density(iris$petalwidth)
plot(d, main="Petal Width Distribution", xlab="Petal Width")
polygon(d, col="red", border="blue")
50
36. Prof. Pier Luca Lanzi
Maps
• Maps are a very effective communications tool that can visualize
thousands of data points in just one figure
51
37. Prof. Pier Luca Lanzi
Mapping Data with R
# http://togaware.com/onepager/MapsO.pdf
# required packages
install.packages("maps")
install.packages("ggmap")
install.packages("oz")
library(maps)
library(ggmap)
library(oz)
library(ggplot2)
library(scales)
ds <- map_data("world")
p <- ggplot(ds, aes(x=long, y=lat, group=group, fill=region))
p <- p + geom_polygon()
p <- p + theme(legend.position = "none”)
52
38. Prof. Pier Luca Lanzi
Mapping Data with R
# World map
p <- ggplot(ds, aes(x=long, y=lat, group=group, fill=region))
p <- p + geom_polygon()
p <- p + theme(legend.position = "none")
p
# Map of Europe
map <- get_map(location="Europe", zoom=4)
p <- ggmap(map)
p
# Google Map
map <- get_map(location="Sydney", zoom=14, maptype="roadmap",
source="google")
p <- ggmap(map)
p
53
39. Prof. Pier Luca Lanzi
Mapping Data with R
# Mapping arrests
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
head(arrests)
# maps the attribute state in their long/lat coordinates
states <- map_data("state")
# join the database of state information with the arrests information using
region
ds <- merge(states, arrests, sort=FALSE, by="region")
# reorder the points for plotting otherwise the borders are plotted
incorrectly
ds <- ds[order(ds$order), ]
g <- ggplot(ds, aes(long, lat, group=group, fill=ds$assault/ds$murder))
g <- g + geom_polygon()
print(g)
54
40. Prof. Pier Luca Lanzi
Mapping Data for Visualization
• Data objects, their attributes, and the relationships among data
objects are translated into graphical elements such as points, lines,
shapes, and colors
• Objects are often represented as points
• Their attribute values can be represented as the position of the
points or the characteristics of the points, e.g., color, size, and
shape
• If position is used, then the relationships of points, i.e., whether
they form groups or a point is an outlier, is easily perceived.
55
41. Prof. Pier Luca Lanzi
Data Arrangement
• Is the placement of visual elements within a display
• Influence on how easy it is to understand the data
56
42. Prof. Pier Luca Lanzi
Selection
• Is the elimination or the de-emphasis of !
certain objects and attributes
• Selection may involve choosing a subset of attributes
! Dimensionality reduction is often used to reduce the number
of dimensions to two or three
! Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset of objects
! A region of the screen can only show so many points
! Can sample, but want to preserve points in sparse areas
57
43. Prof. Pier Luca Lanzi
Box Plots
• Box Plots (invented by Tukey)
• Another way of displaying the distribution of data
• Following figure shows !
the basic part of a box plot
• IQR is the interquartile range !
or Q3-Q1
58
outlier
Min (> Q1 -1.5 IQR)
1st quartile (Q1)
3rd quartile (Q3)
2nd quartile (Q2)
Max ( < Q3 +1.5 IQR)
45. Prof. Pier Luca Lanzi
Boxplots in R
# set working directory
setwd("~/Dropbox/Documents/Online/Corsi/DMTM/data")
# load data
iris = read.arff("iris.arff")
boxplot(iris[,1:4], ylab="values (cm)", main="iris dataset")
boxplot(iris[,1:4], ylab="values (cm)", main="iris dataset", las=2,
names=c("sep len", "sep wid", "pet len", "pet wid"),
col=c("lightblue","lightblue","pink","pink"),
at=c(1,2,5,6))
boxplot(iris[,1:4], ylab="values (cm)", main="iris dataset", las=2,
names=c("sep len", "sep wid", "pet len", "pet wid"),
border=c("blue","blue","green","green"),
at=c(1,2,5,6))
# more examples
# http://www.r-bloggers.com/box-plot-with-r-tutorial/
60
46. Prof. Pier Luca Lanzi
Scatter Plots
• Attributes values determine the position
• Two-dimensional scatter plots most common, but can have
three-dimensional scatter plots
• Often additional attributes can be displayed by using the size,
shape, and color of the markers that represent the objects
• It is useful to have arrays of scatter plots can compactly
summarize the relationships of several pairs of attributes
• Examples: http://www.statmethods.net/graphs/scatterplot.html
61
47. Prof. Pier Luca Lanzi
Scatter Plots in R
# set working directory
setwd("~/Dropbox/Documents/Online/Corsi/DMTM/data")
# load data
iris = read.arff("iris.arff")
# attach iris to the path
attach(iris)
# plot scatter plot
plot(petallength,sepallength,
xlab="Petal Length",
ylab="Sepal Length", pch=19)
# linear model
abline(lm(sepallength~petallength),
col="red") # regression line (y~x)
# local linear model
lines(lowess(petallength, sepallength),
col="blue") # lowess line (x,y)
62
48. Prof. Pier Luca Lanzi
Scatter Plots in R
library(car)
scatterplot(sepallength~petallength | class, data=iris, xlab="Petal Length”,
ylab="Sepal Length", main="Enhanced Scatter Plot", labels=row.names(iris))
63
51. Prof. Pier Luca Lanzi
Scatter Plot Combined with Box Plots 66
52. Prof. Pier Luca Lanzi
Scatter Plots with Box Plots
# Add boxplots to a scatterplot
par(fig=c(0,0.8,0,0.8), new=TRUE)
plot(mtcars$wt, mtcars$mpg, xlab="Miles Per Gallon",
ylab="Car Weight")
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(mtcars$wt, horizontal=TRUE, axes=FALSE)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(mtcars$mpg, axes=FALSE)
mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3)
67
53. Prof. Pier Luca Lanzi
Contour Plots
• Useful for continuous attributes measured on a spatial grid
• They partition the plane into regions of similar values
• The contour lines that form the boundaries of these regions
connect points with equal values
• The most common example is contour maps of elevation
• Can also display temperature, rainfall, air pressure, etc.
68
55. Prof. Pier Luca Lanzi
Matrix Plots
• Can plot the data matrix
• This can be useful when objects are sorted according to class
• Typically, the attributes are normalized to prevent one attribute
from dominating the plot
• Plots of similarity or distance matrices can also be useful for
visualizing the relationships between objects
70
57. Prof. Pier Luca Lanzi
additional examples
http://onepager.togaware.com/PlotsO.pdf
58. Prof. Pier Luca Lanzi
Other Visualization Techniques
• Star Plots
! Similar approach to parallel coordinates, !
but axes radiate from a central point
! The line connecting the values of an object is a polygon
• Chernoff Faces
! Approach created by Herman Chernoff
! This approach associates each attribute with a characteristic of
a face
! The values of each attribute determine the appearance of the
corresponding facial characteristic
! Each object becomes a separate face
! Relies on human’s ability to distinguish faces
73
67. Prof. Pier Luca Lanzi
do we read the articles we share?
does my banner work?
68. Prof. Pier Luca Lanzi
“When we combined attention and traffic to find the story that had
the largest volume of total engaged time, we found that it had fewer
than 100 likes and fewer than 50 tweets. Conversely, the story with
the largest number of tweets got about 20% of the total engaged
time that the most engaging story received.”
69. Prof. Pier Luca Lanzi
“Here’s the skinny, 66% of attention on a normal media page is spent
below the fold.That leaderboard at the top of the page? People
scroll right past that and spend their time where the content not the
cruft is. .”
70. Prof. Pier Luca Lanzi
http://time.com/12933/what-you-think-you-know-about-the-web-is-wrong/?iid=biz-category-mostpop2
71. Prof. Pier Luca Lanzi
http://apps.startribune.com/news/20140313-beer-me-minnesota/