Data may be organized in many different ways; the logical or mathematical model of a particular organization of data is called "Data Structure". The choice of a particular data model depends on two considerations:
It must be rich enough in structure to reflect the actual relationships of the data in the real world.
The structure should be simple enough that one can effectively process the data when necessary.
Data Structure Operations
The particular data structure that one chooses for a given situation depends largely on the nature of specific operations to be performed.
The following are the four major operations associated with any data structure:
i. Traversing : Accessing each record exactly once so that certain items in the record may be processed.
ii. Searching : Finding the location of the record with a given key value, or finding the locations of all records which satisfy one or more conditions.
iii. Inserting : Adding a new record to the structure.
iv. Deleting : Removing a record from the structure.
Primitive and Composite Data Types
Primitive Data Types are Basic data types of any language. In most computers these are native to the machine's hardware.
Some Primitive data types are:
Integer
Data may be organized in many different ways; the logical or mathematical model of a particular organization of data is called "Data Structure". The choice of a particular data model depends on two considerations:
It must be rich enough in structure to reflect the actual relationships of the data in the real world.
The structure should be simple enough that one can effectively process the data when necessary.
Data Structure Operations
The particular data structure that one chooses for a given situation depends largely on the nature of specific operations to be performed.
The following are the four major operations associated with any data structure:
i. Traversing : Accessing each record exactly once so that certain items in the record may be processed.
ii. Searching : Finding the location of the record with a given key value, or finding the locations of all records which satisfy one or more conditions.
iii. Inserting : Adding a new record to the structure.
iv. Deleting : Removing a record from the structure.
Primitive and Composite Data Types
Primitive Data Types are Basic data types of any language. In most computers these are native to the machine's hardware.
Some Primitive data types are:
Integer
Data processing is an integral part of most modern software development. Understanding of Abstract Algebra and Category theory will be beneficial for addressing data processing concerns.
This is a presentation on Arrays, one of the most important topics on Data Structures and algorithms. Anyone who is new to DSA or wants to have a theoretical understanding of the same can refer to it :D
Data processing is an integral part of most modern software development. Understanding of Abstract Algebra and Category theory will be beneficial for addressing data processing concerns.
This is a presentation on Arrays, one of the most important topics on Data Structures and algorithms. Anyone who is new to DSA or wants to have a theoretical understanding of the same can refer to it :D
R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.
Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.
Vibrant Technologies is headquarted in Mumbai,India.We are the best r programming training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best r programming classes in Mumbai according to our students and corporates
A high level introduction to R statistical programming language that was presented at the Chicago Data Visualization Group's Graphing in R and ggplot2 workshop on October 8, 2012.
Abstract: This PDSG workshop introduces the basics of Python libraries used in machine learning. Libraries covered are Numpy, Pandas and MathlibPlot.
Level: Fundamental
Requirements: One should have some knowledge of programming and some statistics.
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...HendraPurnama31
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkualitas publikasi dalam berbagai format cetak dan lingkungan interaktif di berbagai platform.
This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. Introduction to Data Analysis and Graphics2
Introduction to Data Analysis and Graphics2
Hellen Gakuruh
2017-03-07
Session Two
Vector and Assignment, Data Objects and Data Importation
Outline
By the end of this session we will have knowledge on:
• Vectors and Assignment
• Data types
• Data structure and
• Importing data into R
Vector and Assignment
• Simplest data structure in R is a vector. From a data point of view, a
vector is collection of elements. These elements can be numeric values,
alphabetical characters, logical, dates and time values.
• Vectors are created with function “c” which means “concatenate”. e.g. a
numerical vector c(1, 5, 6, 8)
• Thee vectors can be named by using an assignment operator “<-” or
function “assign()”. e.g. to assign vector c(1, 5, 6, 8) to name “num”;
num <- c(1, 5, 6, 8) or assign(“num”, c(1, 5, 6, 8)). We often use “<-” for
assignment, “assign” function is mostly used in developing functions
• A vector can be of any length begining from 1 to about 2.1474836 × 109
1
2. Data types
R recognises seven data types, these are:
• Logical
• Integer
• Real/Double
• String/Character
• Factor
cont. . .
• Complex
• Raw
• R manuals specifys six types; logical, integer, double, character, complex
and raw. However, factor is a data type that does not fall into either of
the six listed data types.
• In this sub-section we introduce these data types
Data types: Logical
• These are vectors with only TRUE and FALSE values like c(TRUE, TRUE,
FALSE, TRUE, FALSE)
• Can be considered as binary vectors in analysis
• Other than categorical variables with these values, these vectors are often
created by binary operators like “<”, “>”, “<=”, >=, ==, =!, “|”, “||”,
“&”, and “&&”
• During analysis, these vectors can be coerced to numeric values in which
case TRUE becomes 1 and FALSE becomes 0
• These vectors include value “NA” which in R means “Not Available”, a
placeholder for missing values.
• Any operation done with a vector containing NA is bound to result to NA
since NA is unknown
Data types: Integer
• These are basically positive and negative numbers without fractions {. . . ,
-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, . . . }
• In R, integers are denoted with letter L e.g. c(-3L, 0L, 2L, 5L, 6L). Can
confirm it’s an integer vector with function is.integer(c(-3L, 0L, 2L,
5L, 5L))
• Example of a variable which can be considered to naturally have integers
is “number of people” (you can’t have a fraction of a person)
• Mathematically denoted by ( mathbb{Z} )
2
3. Real/Double
• A real number is any number along an infinitely number line
• They include fractions
• Denoted mathematically with ( mathbb{R} )
• Any numeric vector that does not have values followed by letter “L” are
considered as double e.g. c(-3, 0, 2, 5, 6). Can confirm a vector is a real
or double vector with funtion “is.double” e.g is.double(c(-3, 0, 2, 5,
6))
String/Character
• Composed of alphabetical letters and word/text
• Denoted by single or double quotation marks
• R has a special vector with alphabetical letter; this is letters
• Example c("a", "b", "c"), letters, c('cats', 'and' , 'dogs')
• Can check whether a vector is a character vector with function
is.character e.g. is.character(letters)
Data type: Factors
n
• In R a factor vector is a categorical variable with discrete classification
(grouping)
• Example
cat <- factor(c(rep("Y", 28), rep("N", 10)))
is.factor(cat)
[1] TRUE
levels(cat)
[1] "N" "Y"
Data type: Complex
n
• These are vectors with real and imaginary values. Imaginary numbers are
denoted by letter “i”
• Mathematically used to make it possible to take square-root of negative
values
3
4. # Example, complex vector
3+2i
[1] 3+2i
# Confirm it's complex
is.complex(3+2i)
[1] TRUE
Data type: Raw
• These are vectors containing computer bytes or information on data storage
units
• More of computer language (0’s and 1’s) than human readable language
• Integers and doubles are jointly refered to as numeric
• The most commonly used data types are logical, numeric and characters.
Complex and raw data types are rarely used
int <- c(-3L, -2L, -1L, 0L, 1L, 2L, 3L)
is.integer(int)
[1] TRUE
is.numeric(int)
[1] TRUE
doub <- c(-3, -2, -1, 0, 1, 2, 3)
is.double(doub)
[1] TRUE
is.numeric(doub)
[1] TRUE
Data structures
• There two broad types of data structures in R
– Atomic vectors
– Generic (list) vectors
• These structures have three properties
– Type
– Length and
– Attributes
4
5. • Function "type" is used to establish a vector’s type, function "length"
is used to determine length and function "attributes" is used to get
additional information about a vector
• Atomic vectors and lists differ in their type as atomic vectors can only
contain one data type while lists can contain any number of data types.
Atomic Vectors
• Contains only one data type, they include 1 dimensional atomic vectors, 2
dimensional atomic vectors called “matrices” and multi-dimensional atomic
vectors called “arrays”.
• Dimensionality can be considered as number of indices required to address
any element in a vector e.g. vector “cat” requires one index to address any
value, for example index “4” means fourth value which is Y
• Single variables are all atomic vectors of one dimension
• To check if a vector is either atomic or list, use is.atomic() or is.list().
Note there is a is.vector() but this checks if vector is named
Atomic vectors: Matrices
• Two dimensional atomic vectors, they contain data of the same type
• Any atomic vector can be converted to a matrix by adding a dim attribute
cat <- c(rep("Y", 28), rep("N", 10))
typeof(cat)
[1] "character"
dim(cat)
NULL
is.matrix(cat)
[1] FALSE
dim(cat) <- c(19, 2)
typeof(cat)
[1] "character"
dim(cat)
[1] 19 2
is.matrix(cat)
5
6. [1] TRUE
• Other than using "dim()" to convert a one dim to a multi-dimension
atomic vector, matrices can be created with "matrix()", or by coercing
another data object with "as.matrix()"
typeof(airmiles)
[1] "double"
airmiles2 <- matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles2)
[1] TRUE
airmiles3 <- as.matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles3)
[1] TRUE
rm(airmiles2, airmiles3)
Special 1 & 2 dimension atomic vectors
Time series objects
• These are vectors used to store observations collected at given time points
(interval) over a period time, e.g. observations collected every three three
months for five year.
• Distiguishing feature in this data is time, interval is usually constant like
three months (regular), but in other cases it might not be so (irregular)
• In R, time series data are numeric vectors with attribute class equal “ts”
meaning time series
• Time series vectors can either be 1 dim atomic vector like “AirPassengers”
data set in R or a 2d matrix like "EuStockMarkets"
typeof(AirPassengers)
[1] "double"
attr(AirPassengers, "class")
[1] "ts"
typeof(EuStockMarkets)
[1] "double"
attr(EuStockMarkets, "class")
[1] "mts" "ts" "matrix"
6
7. Atomic vectors: Arrays
• Arrays are multi-dimensional atomic vectors.
• Matrices are two dimensional array.
• They are rarely used, but it’s good to know they exist
• Created like matrices; "dim()" e.g. dim(a) <- c(6, 2, 2), or array()
or as.array()
Data structures: Generic vectors
• Lists are data structure which can contain more than on type of data type.
• There are two types of lists; two dimensional lists called "data frames"
and "lists"
Data frames
n
• Most recognizable data structure
• A core data strucure in R
• Present data in row and columns like matrices, but in this case columns
can have different data types
# Example
head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
Generic vectors: Lists
• These are unique data structure
• Can contain any number and type of object, not just data. Can contain
sub-lists hence also called recursive
• Created with function “list()”. Can also coerce other structures to a list
with function “as.list()”
• We will create this structure in our next session
7
8. Importing and Exporting Data in R
• Data importation also referred to as “reading in” data
• Reading data depends on type and location of file
• Sub-session interest, reading in local R, text, excel, database and other
statistical program files
• Also discuss web scrapping
Reading in .RData
• Data created in R can be store in RData file
• This could be any data structure or a collection of data saved from an
active working directory (workspace)
• Function “save.image()” used to store workspace, function “load” is used
to read in any “.RData” (or even .Rhistory)
# See current objects
ls()
[1] "cat" "doub" "int"
# Store in an external .RData file
save.image()
# Remove all object from workspace/global environment
rm(list = ls())
ls()
character(0)
# Read in .RData
load(".RData")
# Check we have them back
ls()
[1] "cat" "doub" "int"
R’s core importing function “read.table()”
• read.table is R’s core importing function
• Almost all other functions including contributed packages depend on this
function
• Reads a file and creates a data frame from it
• It has a number of wrapper functions (functions which provide a con-
vinience interface to another function like give pre-defined/default values,
this make function calls more efficient)
8
9. • Wrapper functions include read.csv(), read.csv2(), read.delim,
read.delim2
• CSV are comma separated files
• Delim are text files, word delim means delimited which implys how data
are separate like with tabs
• Both csv and delim are relatively easy to read into R as long as separa-
tor/delimitors are known
• In case separator or delimitor is not known and file cannot be opened, then
best to read in a few lines with read.lines function Live demo (reading
in CSV file)
Reading in Excel files
• Base R does not have a function to read in Excel based files
• But many contributed packages have functions to read them in
• Core reference in importing this type of files is one of R-projects manuals
R Data Import/Export specifically chapter 9.
• Recommendation made is to try and convert Excel file in to “.csv” (comma-
separated) or “delim” (tab-separated) file. Live demo (reading excel file)
Reading in Databases data
• A bit of caution, database data tend to be large, R is not to good when it
comes to large data, hence read in part of data or look for ways to increase
memory allocated to R processes like using cloud.
• Most Relational Database Management Systems (RDMS) have data similar
to R’s dataframe where columns are called “fields” and rows are called
“records”.
• Extracting part of relational database requires use of database quering
sematics core of which is a SELECT statement.
• In general, SELECT query uses:
– FROM to select the table
– WHERE to specify a condition for inclusion and
– ORDER BY to sort results (this is important as RDMS do not order
it’s rows like R’s dataframes)
• There are a number of contributed packaged on CRAN for reading RDMS
data, these include RMySQL, DBI, ROracle, RPostgreSQL and RSQLite.
Live demo (reading in RDMS and web data)
9
10. From other statistical softwares
• Other statistical softwares often used to read in data are SPSS, SAS, Stata
and EpiInfo
• Like excel and database data, to read in these files a package must be used
• Recommended package is package "foreign" other packages include,
"readstata3" and haven.
Live demo (reading SPSS and Stata data files)
10