Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012.
More information about the event can be found at http://lisresearch.org/dream-project/dream-event-4-workshop-wednesday-25-april-2012/
This article gives a few good reasons to migrate, along with an overview of the most important features added to Delphi product releases since version 7.
Leveraging Big Data in Scholarly Communication SpaceMeta
What can be learned by applying big data tools to scholarly information? What are some real world applications? How does this benefit various stakeholders in our space? This presentation will answer these questions.
What has happened to Foresight in the UK?Ian Miles
The UK Foresight Programme has been widely lauded. But how is it valued by the current government? This presentation examines trends in the Programme, which suggest that forebodings expressed in 2010 have proved accurate.
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the FutureJerome Glenn
Future technological synergies will give lone wolf terrorists the ability to make and deploy a weapon of mass destruction (The Millennium Project calls this SIMAD - single individual massively destructive). Need to develop the public's roles around the world to prevent this.
This presentation was made by Wendy McGuinness, Chief Executive of the Sustainable Future Institute at the 7X7 Ideas Exchange: 7 Imaginations. 26 August 2008
Horizon Scanning – Know the future of science todayMeta
Learn how Meta leverages world-leading AI originally created for U.S. Intelligence to provide innovation-driven companies with a privileged view of the future trajectories of science & technology.
This article gives a few good reasons to migrate, along with an overview of the most important features added to Delphi product releases since version 7.
Leveraging Big Data in Scholarly Communication SpaceMeta
What can be learned by applying big data tools to scholarly information? What are some real world applications? How does this benefit various stakeholders in our space? This presentation will answer these questions.
What has happened to Foresight in the UK?Ian Miles
The UK Foresight Programme has been widely lauded. But how is it valued by the current government? This presentation examines trends in the Programme, which suggest that forebodings expressed in 2010 have proved accurate.
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the FutureJerome Glenn
Future technological synergies will give lone wolf terrorists the ability to make and deploy a weapon of mass destruction (The Millennium Project calls this SIMAD - single individual massively destructive). Need to develop the public's roles around the world to prevent this.
This presentation was made by Wendy McGuinness, Chief Executive of the Sustainable Future Institute at the 7X7 Ideas Exchange: 7 Imaginations. 26 August 2008
Horizon Scanning – Know the future of science todayMeta
Learn how Meta leverages world-leading AI originally created for U.S. Intelligence to provide innovation-driven companies with a privileged view of the future trajectories of science & technology.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
Analyst’s Nightmare or Laundering Massive SpreadsheetsPyData
By Feyzi Bagirov
PyData New York City 2017
Poor data quality frequently invalidates data analysis when performed on Excel data that underwent transformations, imputations, and manual manipulations. In this talk we will use Pandas to walk through Excel data analysis and illustrate several common pitfalls that make this analysis invalid.
Presentation by Dr. Peter Bruce, Statistics.com. Presented on April 27, 2012 at the MRA Spring Research Symposium hosted by the Mid-Atlantic Chapter of the Marketing Research Association.
Presentation by Hazel Hall at LIRG LIS research resources briefing, July 10th 2012, London. Further details at http://lisresearch.org/2012/07/10/research-into-practice-lis-research-resources-briefing/
Presentation by Peter Cruickshank at LIRG LIS research resources briefing, July 10th 2012, London. Further details at http://lisresearch.org/2012/07/10/research-into-practice-lis-research-resources-briefing/
More Related Content
Similar to Kevin Swingler: Introduction to Data Mining
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
Analyst’s Nightmare or Laundering Massive SpreadsheetsPyData
By Feyzi Bagirov
PyData New York City 2017
Poor data quality frequently invalidates data analysis when performed on Excel data that underwent transformations, imputations, and manual manipulations. In this talk we will use Pandas to walk through Excel data analysis and illustrate several common pitfalls that make this analysis invalid.
Presentation by Dr. Peter Bruce, Statistics.com. Presented on April 27, 2012 at the MRA Spring Research Symposium hosted by the Mid-Atlantic Chapter of the Marketing Research Association.
Presentation by Hazel Hall at LIRG LIS research resources briefing, July 10th 2012, London. Further details at http://lisresearch.org/2012/07/10/research-into-practice-lis-research-resources-briefing/
Presentation by Peter Cruickshank at LIRG LIS research resources briefing, July 10th 2012, London. Further details at http://lisresearch.org/2012/07/10/research-into-practice-lis-research-resources-briefing/
Series of short presentations by members of the
DREaM workshop cadre and conference delegates at the LIS DREaM final project conference.
For more information about this event, see http://lisresearch.org/dream-project/dream-event-5-conference-monday-9-july-2012/
Presentation by Louise Cook at the LIS DREaM final conference.
More information about this event is available at http://lisresearch.org/dream-project/dream-event-5-conference-monday-9-july-2012/
Presentation by Hazel Hall at the LIS DREaM final conference.
More information about this event is available at http://lisresearch.org/dream-project/dream-event-5-conference-monday-9-july-2012/
Presentation by Carol Tenopir at the LIS DREaM final conference.
More information about this event is available at http://lisresearch.org/dream-project/dream-event-5-conference-monday-9-july-2012/
Presentation on the DREaM project delivered by Dr Alison Brettle, Professor Hazel Hall and Professor Charles Oppenheim at QQML2012, Limerick, May 22-25 2012.
Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012.
More information about the event can be found at http://lisresearch.org/dream-project/dream-event-4-workshop-wednesday-25-april-2012/
Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012.
More information about the event can be found at http://lisresearch.org/dream-project/dream-event-4-workshop-wednesday-25-april-2012/
Welcome slides for the LIS DREaM workshop 3 at the Britist Library on Monday 30th January 2012.
Further details about this event can be found at http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/
Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.
More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/
Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.
More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/
Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.
More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/
Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.
More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/
The Research in Librarianship Impact Evaluation Project (RiLIES - pronounced 'realise') completed in August 2011 explored the extent to which funded librarianship research projects influence library practice in the UK. Of particular interest in the findings are the factors that increase or hinder the impact or project outcomes on practice.
This presentation, delivered at Online 2011, relates the main findings of the project related to: the relationship between the library and information science research and practitioner communities; how researchers can improve the impact of their research with careful attention to how projects are planned, conceived, implemented and reported; organisational factors that support the development of a receptive audience for research output.
Slides to accompany Dr Paul Lynch's workshop session "An introduction to ethnography" presented at DREaM Event 2.
For more information about this event, please visit http://lisresearch.org/dream-project/dream-event-2-workshop-tuesday-25-october-2011/
Handout to accompany Charles Oppenheim's presentation "Research Ethics and Legal Issues" at the DREaM Event 2 workshop.
For more information about this event, please visit http://lisresearch.org/dream-project/dream-event-2-workshop-tuesday-25-october-2011/
More from Library and Information Science Research Coalition (20)
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Instructions for Submissions thorugh G- Classroom.pptx
Kevin Swingler: Introduction to Data Mining
1. Data Mining Methodology
Kevin Swingler
University of Stirling
Lecturer, Computing Science
kms@cs.stir.ac.uk
2. What is Data Mining?
• Generally, methods of using large quantities of data
and appropriate algorithms to allow a computer to
‘learn’ to perform a task
• Task oriented:
– Predict outcomes or forecast the future
– Classify objects as belonging to one of several categories
– Separate data into clusters of similar objects
• Most methods produce a model of the data that
performs the task
2
3. Some Examples
• Predicting patterns of drug side-effects
• Spotting credit card or insurance fraud
• Controlling complex machinery
• Predicting the outcome of medical
interventions
• Predicting the price of stocks and shares or
exchange rates
• Knowing when a cow is most fertile (really!)
3
4. Examples in LIS
• Text Mining
– Automatically determine what an article is ‘about’
– Classify attitudes in social media
• Demand Prediction
– Predicting demand for resources such as new books or
journals or buildings
• Search and Recommend
– Analysis of borrowing history to make recommendations
– Links analysis for citation clustering
4
5. Data Sources
• In House – Data you own
– Borrow records
– Search histories
– Catalogue data
• Bought in
– Demographic data about customers
– Demographic data about the locality around a
library
5
6. Methods
• Techniques for data mining are based on
mathematics and statistics, but are
implemented in easy to use software
packages
• Where methodology is important is in pre-
processing the data, choosing the techniques,
and interpreting the results
6
8. Data Preparation
• Clean the data
– Remove rows with missing values
– Remove rows with obvious data entry errors – e.g.
Age = 200
– Recode obvious data entry inconsistencies – e.g. If
Gender = M or F, but occasionally Male
– Remove rows with minority values
– Select which variables to use in the model
8
9. Data Quantity
• Choose the variables to be used for the model
• Look at the distributions of the chosen values
• Look at the level of noise in the data
• Look at the degree of linearity in the data
• Decide whether or not there are sufficient
examples in the data
• Treat unbalanced data
9
10. Consider Error Costs
• Imagine a system that classifies input patterns
into one of several possible categories
• Sometimes it will get things wrong, how often
depends on the problem:
– Direct mail targeting – very often
– Credit risk assessment – quite often
– Medical reasoning – very infrequently
10
11. Error Costs
• An error in one direction can cost more than
an error in the opposite direction
– Recommending a blood test based on a false
positive is better than missing an infection due to
a false negative
– Missing a case of insurance fraud is more costly
than flagging a claim to be double checked
• The balance of examples in each case can be
manipulated to reflect the cost
11
12. Check Points
• Data quantity and quality: do you have
sufficient good data for the task?
– How many variables are there?
– How complex is the task?
– Is the data’s distribution appropriate?
• Outliers
• Balance
• Value set size
12
13. Distributions
• A frequency distribution is a count of how
often each variable contains each value in a
data set
• For discrete numbers and categorical values,
this is simply a count of each value
• For continuous numbers, the count is of how
many values fall into each of a set of sub-
ranges
13
15. Features of a Distribution
to Look For
• Outliers
• Minority values
• Data Balance
• Data entry errors
15
16. Outliers
• A small number of values that are much larger
or much smaller than all the others
• Can disrupt the data mining process and give
misleading results
• You should either remove them or, if they are
important, collect more data to reflect this
aspect of the world you are modelling
• Could be data entry errors
16
17. Minority Values
• Values that only appear infrequently in the data
• Do they appear often enough to contribute to the
model?
• Might be worth removing them from the data or
collecting more data where they are represented
• Are they needed in the finished system?
• Could they be the result of data entry errors?
17
18. Minority Values
600
500
400
300
200
100
0
Male Female M F
What does this chart tell you about the gender variable in a data set?
What should you do before modelling or mining the data?
18
19. Flat and Wide Variables
• Variables where all the values are minority values
have a flat, wide distribution – one or two of each
possible value
• Such variables are of little use in data mining because
the goal of DM is to find general patterns from
specific data
• No such patterns can exist if each data point is
completely different
• Such variables should be excluded from a model
19
20. Data Balance
• Imagine I want to predict whether or not a
prospective customer will respond to a mailing
campaign
• I collect the data, put it into a data mining
algorithm, which learns and reports a success
rate of 98%
• Sounds good, but when I put a new set of
prospects through to see who to mail, what
happens?
20
21. A Problem
• … the system predicts ‘No’ for every single
prospect.
• With a response rate on a campaign of 2%,
then the system is right 98% of the time if it
always says ‘No’.
• So it never chooses anybody to target in the
campaign
21
22. A Solution
• One data pre-processing solution is to balance the number of
examples of each target class in the output variable
• In our previous example: 50% customers and 50% non-
customers
• That way, any gain in accuracy over 50% would certainly be
due to patterns in the data, not the prior distribution
• This is not always easy to achieve – you might need to throw
away a lot of data to balance the examples, or build several
models on balanced subsets
• Not always necessary – if an event is rare because its cause is
rare, then the problem won’t arise
22
23. Data Quantity
• How much data do you need?
• How long is a piece of string?
• Data must be sufficient to:
– Represent the dynamics of the system to be
modelled
– Cover all situations likely to be encountered when
predictions are needed
– Compensate for any noise in the data
23
24. Model Building
• Choose a number of techniques suitable to
the task:
– Neural network for prediction or classification
– Decision tree for classification
– Rule induction for classification
– Bayesian network for classification
– K-Means for clustering
24
25. Train Models
• For each technique:
– Run a series of experiments with different
parameters
– Each experiment should use around 70% of the
data for training and the rest for testing
– When a good solution is found, use cross
validation (10 fold is a good choice) to verify the
result
25
26. Cross Validation
• Split the data into ten subsets, then train 10
models – each one using 9 of the 10 subsets
as training data and the 10th as test. The score
is the average of all 10.
• This is a more accurate representation of how
well the data may be modelled, as it reduces
the risk of getting a lucky test set
26
27. Assess Models
• You can measure the success of your model in a
number of ways
– Mean Squared error – not always meaningful
– Percentage correct for classification
– Confusion matrix for classification
Output= True False
True 80 30
False 20 90
27
28. Probability Outputs
• Most classification techniques provide a score
with the classification – either a probability or
some other measure
• This can be used:
– Allow an answer of “unsure” for cases where no
single class has a high enough probability
– Weighting outputs to allow for unequal cost of
outcomes
– Lift charts and ROC curves
28
29. Generalisation and Over Fitting
• Most data mining models have a degree of
complexity that can be controlled by the
designer
• The goal is to find the degree of complexity
that is best suited to the data
• A model that is too simple over generalises
• A model that is too complex over fits
• Both have an adverse effect on performance
29
30. Gen-Spec Trade Off
• Adding to the complexity of the model fits the
training data better at the expense of higher
test error
30
31. Repeat or Finish
• The result of the data mining will leave you
with either a model that works or the need to
improve
• More data may need to be collected
• Different variables might be tried
• The process can loop several times before a
satisfactory answer is found
31
32. Understanding and Using the Results
• The resulting model has the ability to perform
the task it was set, so can be embedded in an
automated system
• Some techniques produce models that are
human readable and allow insights into the
structure of the data
• Some are almost impossible to extract
knowledge from
32