Summarizing Data : Listing and Grouping pdfJustynOwen
Introduction
Descriptive Statistics describe basic features of the data gathered from an experimental study in various ways.
They provide simple summaries about the sample via graphs and numbers, mainly measures of center and variation.
Together with graphics analysis (histograms, bar plots, pie-charts), they are the cornerstone of quantitative data analysis.
Tables (frequency distributions, stem-and-leaf plots, …) that summarize the data.
Graphical representations of the data (histograms, bar plots, pie-charts).
Summary statistics (numbers) which summarize the data
Advance Microsoft Office Excel Course.pptxssuserc9f959
Microsoft Excel is a spreadsheet application that allows users to enter and store numeric data in a tabular format. It includes features like formulas, charts, pivot tables, and conditional formatting that enable data analysis and visualization. Pivot tables are a key tool for summarizing, analyzing, and comparing trends in large datasets. They allow filtering of data using slicers, autofilters, and pivot table filters. Charts can also be inserted into pivot tables to add visualizations to the data.
This document provides an overview of exploratory data analysis (EDA) for machine learning applications. It discusses identifying data sources, collecting data, and the machine learning process. The main part of EDA involves cleaning, preprocessing, and visualizing data to gain insights through descriptive statistics and data visualizations like histograms, scatter plots, and boxplots. This allows discovering patterns, errors, outliers and missing values to understand the dataset before building models.
The document discusses key concepts in GIS including coordinate systems, map projections, transformations between coordinate systems, spatial queries, classification of data, symbolization, and labeling. It explains that coordinate systems use coordinates to identify locations on Earth, and that projections are needed to display coordinate systems on a flat surface from the curved Earth. It also discusses different methods for classifying data, choosing appropriate symbols, and how to automatically generate labels for features on a map.
This document discusses multi-dimensional modeling and data warehousing implementation. It describes prediction cubes, which store prediction models in a multidimensional space to enable predictive analytics in an OLAP manner. It also covers attribute-oriented induction for data generalization, including attribute removal, generalization, and thresholding. Regarding data warehouse implementation, it outlines efficient data cube computation through cuboid materialization and indexing techniques like bitmap indexes and join indices to speed up OLAP queries.
The document discusses various data visualization techniques using Matplotlib in Python. It covers creating basic line plots and scatter plots, customizing plots by adding labels, legends, colors and styles. It also discusses different chart types like pie charts, bar charts, histograms and boxplots. Advanced techniques like showing correlations and time series analysis are also covered. The document provides code examples for each visualization technique.
This document discusses various methods for describing data statistically, including tables, figures, and text. It provides examples and guidelines for different types of tables and figures (line graphs, area graphs, bar graphs, pie charts) and discusses when each is most appropriate. The key methods covered are tabular presentations, line graphs to show changes over time, bar graphs for comparisons, and pie charts to show proportions of a whole. It concludes by providing recommendations for choosing between tables, figures, and text based on the type of data being presented and the goals of the analysis.
This document provides an overview of different types of graphs that can be used to present statistical data, including histograms, pie charts, bar charts, line charts, cubic graphs, response surface plots, and contour plots. It discusses the purpose and construction of each graph type, advantages and disadvantages, and provides examples of how and when each type of graph might be used. The overall goal is to help students identify, construct, and properly label different graphs to effectively communicate statistical data.
Summarizing Data : Listing and Grouping pdfJustynOwen
Introduction
Descriptive Statistics describe basic features of the data gathered from an experimental study in various ways.
They provide simple summaries about the sample via graphs and numbers, mainly measures of center and variation.
Together with graphics analysis (histograms, bar plots, pie-charts), they are the cornerstone of quantitative data analysis.
Tables (frequency distributions, stem-and-leaf plots, …) that summarize the data.
Graphical representations of the data (histograms, bar plots, pie-charts).
Summary statistics (numbers) which summarize the data
Advance Microsoft Office Excel Course.pptxssuserc9f959
Microsoft Excel is a spreadsheet application that allows users to enter and store numeric data in a tabular format. It includes features like formulas, charts, pivot tables, and conditional formatting that enable data analysis and visualization. Pivot tables are a key tool for summarizing, analyzing, and comparing trends in large datasets. They allow filtering of data using slicers, autofilters, and pivot table filters. Charts can also be inserted into pivot tables to add visualizations to the data.
This document provides an overview of exploratory data analysis (EDA) for machine learning applications. It discusses identifying data sources, collecting data, and the machine learning process. The main part of EDA involves cleaning, preprocessing, and visualizing data to gain insights through descriptive statistics and data visualizations like histograms, scatter plots, and boxplots. This allows discovering patterns, errors, outliers and missing values to understand the dataset before building models.
The document discusses key concepts in GIS including coordinate systems, map projections, transformations between coordinate systems, spatial queries, classification of data, symbolization, and labeling. It explains that coordinate systems use coordinates to identify locations on Earth, and that projections are needed to display coordinate systems on a flat surface from the curved Earth. It also discusses different methods for classifying data, choosing appropriate symbols, and how to automatically generate labels for features on a map.
This document discusses multi-dimensional modeling and data warehousing implementation. It describes prediction cubes, which store prediction models in a multidimensional space to enable predictive analytics in an OLAP manner. It also covers attribute-oriented induction for data generalization, including attribute removal, generalization, and thresholding. Regarding data warehouse implementation, it outlines efficient data cube computation through cuboid materialization and indexing techniques like bitmap indexes and join indices to speed up OLAP queries.
The document discusses various data visualization techniques using Matplotlib in Python. It covers creating basic line plots and scatter plots, customizing plots by adding labels, legends, colors and styles. It also discusses different chart types like pie charts, bar charts, histograms and boxplots. Advanced techniques like showing correlations and time series analysis are also covered. The document provides code examples for each visualization technique.
This document discusses various methods for describing data statistically, including tables, figures, and text. It provides examples and guidelines for different types of tables and figures (line graphs, area graphs, bar graphs, pie charts) and discusses when each is most appropriate. The key methods covered are tabular presentations, line graphs to show changes over time, bar graphs for comparisons, and pie charts to show proportions of a whole. It concludes by providing recommendations for choosing between tables, figures, and text based on the type of data being presented and the goals of the analysis.
This document provides an overview of different types of graphs that can be used to present statistical data, including histograms, pie charts, bar charts, line charts, cubic graphs, response surface plots, and contour plots. It discusses the purpose and construction of each graph type, advantages and disadvantages, and provides examples of how and when each type of graph might be used. The overall goal is to help students identify, construct, and properly label different graphs to effectively communicate statistical data.
1. The document discusses various data visualization techniques including tables, charts like scatter plots, line charts and bar charts, and advanced visualizations like parallel coordinate plots and treemaps.
2. It explains best practices for table and chart design including minimizing non-data ink and aligning text and numbers.
3. Data dashboards are described as visualization tools that automatically update metrics and convey key performance indicators to users through techniques like size, position and color.
Graphs are used to visually represent data and relationships between variables. There are various types of graphs that can be used for different purposes. Histograms represent the distribution of continuous variables. Bar graphs display the distribution of categorical variables or allow for comparisons between categories. Line graphs show trends and patterns over time. Pie charts summarize categorical data as percentages of a whole. Cubic graphs refer to graphs where all vertices have a degree of three. Response surface plots visualize the relationship between multiple independent variables and a response variable.
Dr. D. Sugumar discusses using Microsoft Excel to analyze measurement data from an experiment. Key points covered include using Excel to calculate statistics like mean, median, mode, and standard deviation. Students will take measurements, input the data into Excel, and use functions and charts to analyze the results. Formatting, sorting, filtering and other Excel skills are reviewed to facilitate the data analysis task.
You can enter formulas in two ways, either directly into the cell itself, or at the input line. Either way, you need to start a formula with one of the following symbols: =, + or –. Starting with anything else causes the formula to be treated as if it were text.
Creating Formulas
Understanding Functions
Using regular expressions in functions
Using Pivot tables
The DataPilot dialog
This document provides an introduction to spreadsheets and their main components. It discusses labels, values, formulas and functions. It also outlines some common uses of spreadsheets like budgets, grades, and financial statements. The document identifies the parts of a spreadsheet window like columns, rows, cells, and describes entering different data types. It explains formulas and functions, relative and absolute referencing, and basic formatting and analysis tools like sorting, charts and graphs. Practical examples are provided on formatting cells and changing column widths.
Data science combines fields like statistics, programming, and domain expertise to extract meaningful insights from data. It involves preparing, analyzing, and modeling data to discover useful information. Exploratory data analysis is the process of investigating data to understand its characteristics and check assumptions before modeling. There are four types of EDA: univariate non-graphical, univariate graphical, multivariate non-graphical, and multivariate graphical. Python and R are popular tools used for EDA due to their data analysis and visualization capabilities.
This document introduces descriptive statistics such as mean, median, mode, variance and standard deviation. It demonstrates how to calculate these statistics in Excel and Stata using sample student grade and unemployment rate datasets. Key charts for presenting data like histograms, bar charts and line charts are also illustrated for both programs. The concept of correlation is discussed and how to calculate the correlation coefficient to understand relationships between variables.
This chapter discusses Excel charts and their components. It covers various chart types like column, bar, pie and line charts. It describes how to create and modify charts by changing their type, location, or data source. It also discusses formatting chart elements, titles, labels and legends. The chapter aims to teach readers how to work with charts in Excel.
Unit III covers data visualization. It discusses how data visualization tools are needed to analyze and understand large amounts of data. Effective data visualization presents conclusions, chooses appropriate graph types, and ensures visuals accurately reflect numbers to prevent misinterpretations. History of data visualization is discussed using Napoleon's 1812 march as an example. Advantages of data visualization include easily sharing information and exploring opportunities, while disadvantages can include biased information and losing core messages.
This document provides an overview of exploratory data analysis (EDA). It discusses the key goals of EDA as understanding the characteristics of a dataset and selecting appropriate analysis tools. The document outlines common EDA tasks like calculating summary statistics, creating visualizations, and detecting patterns and anomalies. Specific techniques covered include frequency tables, measures of central tendency and spread, histograms, box plots, contingency tables, and scatter plots. The document emphasizes exploring one variable at a time before examining relationships between multiple variables to better understand the dataset.
202312 Exploration of Data Analysis VisualizationFEG
This document provides a tutorial on data visualization and analysis using Orange 3. It discusses different types of charts like pie charts, line charts, histograms, bar charts, scatter plots, box plots, and pivot tables. It demonstrates how to visualize survival rates from the Titanic dataset based on features like sex, passenger class, age, and fare paid. Key findings are that women and higher class passengers had higher survival rates, and survival rates also depended on combinations of these features.
The document provides an overview of spreadsheets, including their history from early programs like VisiCalc to modern options like Excel. It describes the basic components and features of spreadsheets, how to enter different data types, perform calculations with formulas and functions, and navigate within and between worksheets. The document also discusses some limitations of spreadsheets and how they have evolved over time to support more advanced business needs.
Microsoft Excel is a spreadsheet program that stores and organizes data in workbooks and worksheets. It allows users to perform calculations and analyze information. The Excel window contains components like the active cell, column and row headings, and toolbars. Users can navigate within and between worksheets using arrow keys, scroll bars, tabs, and navigation buttons. They can enter data, formulas, and functions into cells and format worksheets by adjusting rows, columns, and printing options. Charts provide visual representations of workbook data and can be created using the Chart Wizard to select chart types and data ranges.
The document discusses data visualization and analytics. It defines data visualization as the graphical representation of information and data using visual elements like charts and graphs. This provides an accessible way to see trends, outliers, and patterns in data. Data visualization sits at the intersection of analysis and visual storytelling, helping make data understandable and informing decisions. The document also covers types of visualizations, examples, tools for data visualization like Tableau and Excel, and factors to consider when choosing analytics tools.
The document discusses several quality tools and techniques used for data collection and analysis, including check sheets, histograms, Pareto charts, scatter plots, flowcharts, cause and effect diagrams, control charts, and several new management and planning tools such as affinity diagrams, interrelationship digraphs, process decision program charts, tree diagrams, matrix diagrams, activity network diagrams, and prioritization matrices. These tools help visualize problems, identify causes and relationships, plan processes, and make better decisions.
This document discusses different methods for presenting data visually, including tables, charts, graphs, and diagrams. It describes various types of graphs like bar graphs, line charts, scatter plots, and histograms that can be used to summarize different types of data like categorical, numerical, and relationships between variables. For each graph type, it provides examples and discusses when they are best used to present data clearly and help people understand the significance and trends in the data. The key message is that the correct presentation of data through high-quality tables and graphs is important for efficient and clear communication of results.
This document discusses the presentation of results and data in research papers. It explains that the results section presents findings from the study in tables, figures and narrative text. The discussion section interprets the results and explains how they help answer the research question. Data should be presented clearly and concisely using tables, figures, graphs or other visuals as appropriate. The purpose is to report the key findings and discuss their implications.
1. The document discusses various data visualization techniques including tables, charts like scatter plots, line charts and bar charts, and advanced visualizations like parallel coordinate plots and treemaps.
2. It explains best practices for table and chart design including minimizing non-data ink and aligning text and numbers.
3. Data dashboards are described as visualization tools that automatically update metrics and convey key performance indicators to users through techniques like size, position and color.
Graphs are used to visually represent data and relationships between variables. There are various types of graphs that can be used for different purposes. Histograms represent the distribution of continuous variables. Bar graphs display the distribution of categorical variables or allow for comparisons between categories. Line graphs show trends and patterns over time. Pie charts summarize categorical data as percentages of a whole. Cubic graphs refer to graphs where all vertices have a degree of three. Response surface plots visualize the relationship between multiple independent variables and a response variable.
Dr. D. Sugumar discusses using Microsoft Excel to analyze measurement data from an experiment. Key points covered include using Excel to calculate statistics like mean, median, mode, and standard deviation. Students will take measurements, input the data into Excel, and use functions and charts to analyze the results. Formatting, sorting, filtering and other Excel skills are reviewed to facilitate the data analysis task.
You can enter formulas in two ways, either directly into the cell itself, or at the input line. Either way, you need to start a formula with one of the following symbols: =, + or –. Starting with anything else causes the formula to be treated as if it were text.
Creating Formulas
Understanding Functions
Using regular expressions in functions
Using Pivot tables
The DataPilot dialog
This document provides an introduction to spreadsheets and their main components. It discusses labels, values, formulas and functions. It also outlines some common uses of spreadsheets like budgets, grades, and financial statements. The document identifies the parts of a spreadsheet window like columns, rows, cells, and describes entering different data types. It explains formulas and functions, relative and absolute referencing, and basic formatting and analysis tools like sorting, charts and graphs. Practical examples are provided on formatting cells and changing column widths.
Data science combines fields like statistics, programming, and domain expertise to extract meaningful insights from data. It involves preparing, analyzing, and modeling data to discover useful information. Exploratory data analysis is the process of investigating data to understand its characteristics and check assumptions before modeling. There are four types of EDA: univariate non-graphical, univariate graphical, multivariate non-graphical, and multivariate graphical. Python and R are popular tools used for EDA due to their data analysis and visualization capabilities.
This document introduces descriptive statistics such as mean, median, mode, variance and standard deviation. It demonstrates how to calculate these statistics in Excel and Stata using sample student grade and unemployment rate datasets. Key charts for presenting data like histograms, bar charts and line charts are also illustrated for both programs. The concept of correlation is discussed and how to calculate the correlation coefficient to understand relationships between variables.
This chapter discusses Excel charts and their components. It covers various chart types like column, bar, pie and line charts. It describes how to create and modify charts by changing their type, location, or data source. It also discusses formatting chart elements, titles, labels and legends. The chapter aims to teach readers how to work with charts in Excel.
Unit III covers data visualization. It discusses how data visualization tools are needed to analyze and understand large amounts of data. Effective data visualization presents conclusions, chooses appropriate graph types, and ensures visuals accurately reflect numbers to prevent misinterpretations. History of data visualization is discussed using Napoleon's 1812 march as an example. Advantages of data visualization include easily sharing information and exploring opportunities, while disadvantages can include biased information and losing core messages.
This document provides an overview of exploratory data analysis (EDA). It discusses the key goals of EDA as understanding the characteristics of a dataset and selecting appropriate analysis tools. The document outlines common EDA tasks like calculating summary statistics, creating visualizations, and detecting patterns and anomalies. Specific techniques covered include frequency tables, measures of central tendency and spread, histograms, box plots, contingency tables, and scatter plots. The document emphasizes exploring one variable at a time before examining relationships between multiple variables to better understand the dataset.
202312 Exploration of Data Analysis VisualizationFEG
This document provides a tutorial on data visualization and analysis using Orange 3. It discusses different types of charts like pie charts, line charts, histograms, bar charts, scatter plots, box plots, and pivot tables. It demonstrates how to visualize survival rates from the Titanic dataset based on features like sex, passenger class, age, and fare paid. Key findings are that women and higher class passengers had higher survival rates, and survival rates also depended on combinations of these features.
The document provides an overview of spreadsheets, including their history from early programs like VisiCalc to modern options like Excel. It describes the basic components and features of spreadsheets, how to enter different data types, perform calculations with formulas and functions, and navigate within and between worksheets. The document also discusses some limitations of spreadsheets and how they have evolved over time to support more advanced business needs.
Microsoft Excel is a spreadsheet program that stores and organizes data in workbooks and worksheets. It allows users to perform calculations and analyze information. The Excel window contains components like the active cell, column and row headings, and toolbars. Users can navigate within and between worksheets using arrow keys, scroll bars, tabs, and navigation buttons. They can enter data, formulas, and functions into cells and format worksheets by adjusting rows, columns, and printing options. Charts provide visual representations of workbook data and can be created using the Chart Wizard to select chart types and data ranges.
The document discusses data visualization and analytics. It defines data visualization as the graphical representation of information and data using visual elements like charts and graphs. This provides an accessible way to see trends, outliers, and patterns in data. Data visualization sits at the intersection of analysis and visual storytelling, helping make data understandable and informing decisions. The document also covers types of visualizations, examples, tools for data visualization like Tableau and Excel, and factors to consider when choosing analytics tools.
The document discusses several quality tools and techniques used for data collection and analysis, including check sheets, histograms, Pareto charts, scatter plots, flowcharts, cause and effect diagrams, control charts, and several new management and planning tools such as affinity diagrams, interrelationship digraphs, process decision program charts, tree diagrams, matrix diagrams, activity network diagrams, and prioritization matrices. These tools help visualize problems, identify causes and relationships, plan processes, and make better decisions.
This document discusses different methods for presenting data visually, including tables, charts, graphs, and diagrams. It describes various types of graphs like bar graphs, line charts, scatter plots, and histograms that can be used to summarize different types of data like categorical, numerical, and relationships between variables. For each graph type, it provides examples and discusses when they are best used to present data clearly and help people understand the significance and trends in the data. The key message is that the correct presentation of data through high-quality tables and graphs is important for efficient and clear communication of results.
This document discusses the presentation of results and data in research papers. It explains that the results section presents findings from the study in tables, figures and narrative text. The discussion section interprets the results and explains how they help answer the research question. Data should be presented clearly and concisely using tables, figures, graphs or other visuals as appropriate. The purpose is to report the key findings and discuss their implications.
2023 Supervised Learning for Orange3 from scratchFEG
This document provides an overview of supervised learning and decision tree models. It discusses supervised learning techniques for classification and regression. Decision trees are explained as a method that uses conditional statements to classify examples based on their features. The document reviews node splitting criteria like information gain that help determine the most important features. It also discusses evaluating models for overfitting/underfitting and techniques like bagging and boosting in random forests to improve performance. Homework involves building a classification model on a healthcare dataset and reporting the results.
This document provides an overview of unsupervised learning techniques including k-means clustering and association rule mining. It begins with introductions to the speaker and tutorial topics. It then contrasts supervised vs unsupervised learning, describing how k-means is used for clustering without labels and how association rules can discover relationships between items. The document provides examples of applying these techniques in domains like retail, sports, email marketing and healthcare. It also includes visualizations and discusses important concepts for k-means like data transformation and for association rules like support, confidence and lift. Homework questions are asked about preparing data for these algorithms in Orange.
202312 Exploration Data Analysis Visualization (English version)FEG
This document provides an overview of exploratory data analysis (EDA) and visualization techniques that can be performed before building a machine learning model. It introduces the Iris dataset as an example and outlines the key steps of EDA, including loading the data, examining correlations, creating scatter plots, and generating distribution and box plots to understand feature statistics. As homework, students are asked to explore another dataset with a numeric target feature called "housing.tab" and explain the visualizations.
Transfer learning (TL) is a research problem in machine learning (ML) that focuses on applying knowledge gained while solving one task to a related task
This document provides a summary of image classification using deep learning techniques. It begins with an introduction to the speaker and their background. It then discusses the main types of image AI tasks like classification, detection, and segmentation. The document reviews the history and timeline of deep learning, important datasets like ImageNet, and algorithms such as convolutional neural networks. It presents the typical process flow for image-based deep learning including feature extraction using convolutional and pooling layers, classification layers, and different network architectures. The document concludes by discussing a homework assignment on building a multi-class image classification model using a dataset of dog, cat, and bird images.
This document provides an introduction and tutorial on using Google Colab. It discusses the speaker's background and experience, then demonstrates how to run sample Python codes in a Colab notebook. It shows how to open an existing Colab file, access computing resources on Colab including GPUs and TPUs, create a new Colab file, and interact with a Google Drive folder to access and save files. The document concludes by providing a homework assignment to have students run Python code in Colab and interact with their Google Drive.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
5. Basic functions
• Data pre-processing
• Set the ID column as primary key
• Join the salary column on primary key (Using VLOOKUP)
• Convert Height to Height_cm
5
=IFERROR(VLOOKUP(A2,salary!$A$2:$D$452,4,FALSE),0)
=MID(G2,2,1)*30.38 + MID(G2,4,1)*2.54
6. Basic functions
• Frozen the first row
• Conditional Formatting
• Age: Set RED color to under the average of Age
• Salary: Set GREEN color to above the average of Salary
• Position: Set YELLOW color to C of Position
• Sorting by Age or Salary for data investigation
• Sorting by Salary or Age
• Sorting by cell color
6
7. Basic functions
• Delete the Salary value equals to Zero
• Manually delete it
• Pivot table with Team, Position, Age, Salary
• Set Average on Age
• Sorting by Salary
7
8. Basic functions
• Binning by Age
• Find out the Max, Min, Ave, Range, Scale, Bin_interval, Dev
• Set each Bins of Age
• Frequency count of each bins
• First area the all cells, enter the formular in the first cell
• Ctrl + Shift + Enter for { } calculation
8
=FREQUENCY(F$2:F$449,P$3:P$8)
9. Basic functions
• Find the empty rate of College column
• Pick the College columns where includes the blank cells
• Click on Find & Select button and Click on Special objects
• Click on BLANK option
• key-in BLANK word in the top of edit area and CTRL + Enter to all the cells
• Distinct College column by utilizing the advanced filtering
• Count each distinct College frequency and ratio them
• Sorting by College ratio column
9
=COUNTIF(H$2:H$449,M2)
=N2/SUM(N$2:N$118)
10. Static chart
• There are generally three steps in drawing a chart:
• Observing the data, determine the relationship, and select the chart.
• What type of data it is, and what content you want to express.
• After clarifying the content to be expressed, you can choose which chart to
use to express it.
10
11. Pie chart
• You must have some kind of whole
amount that is divided into a number
of distinct parts.
• Your primary objective in a pie chart
should be to compare each group’s
contribution to the whole.
11
12. Line chart
• Line charts provide the clearest
graphical representation of time-
related variables and are the
preferred mode for representing
trends or variables over time.
12
13. Histogram chart
• It is used to summarize discrete
or continuous data that are
measured on an interval scale.
• It is often used to illustrate the
major features of the distribution
of the data in a convenient form.
13
14. Bar chart
• It provides a way of showing
data values represented as
the comparison of multiple
data sets side by side.
14
15. Differences between histogram and bar chart
Comparison terms Bar chart Histogram
Usage
To compare different categories of
data.
To display the distribution of a variable.
Type of variable Categorical variables Numeric variables
Rendering
Each data point is rendered as a
separate bar.
The data points are grouped and
rendered based on the bin value.
The entire range of data values is
divided into a series of non-
overlapping intervals.
Space between bars Can have space. No space.
Reordering bars Can be reordered. Cannot be reordered.
15
16. Scatter Plot
• It uses dots to
represent values for
two different numeric
variables and observe
relationships between
variables.
16
17. Bubble chart
• Bubble Charts are typically
used to compare and show
the relationships between
categorized circles, by the use
of positioning and
proportions.
• The overall picture of Bubble
Charts can be used to analyze
for patterns/correlations.
17
18. Box plot
• Q1: The first quartile (25%) position.
• Q3: The third quartile (75%) position.
• Interquartile range (IQR)
• Lower and upper 1.5*IQR whiskers:
These represent the limits and
boundaries for the outliers.
• Outliers: Defined as observations that
fall below Q1 − 1.5 IQR or above Q3 +
1.5 IQR.
18
19. Tree map
• It can display different data
in different color blocks,
and you can see the
comparison of the value of
each data through the size
of the block. When the
range of the block is larger,
it means that the value of
the data is larger.
19
20. Home work
Find a sample data source, try to pre-process in Excel and visualize it.
20