This document provides an introduction to data mining. It discusses the growth of data and need for data mining techniques. It covers example applications like customer relationship management and tax fraud detection. The document also discusses the differences between supervised and unsupervised learning, as well as related fields like statistics, machine learning, databases and knowledge discovery. It provides an overview of the CRISP-DM process for knowledge discovery and data mining.
Operations Research and ICT A Keynote AddressElvis Muyanja
By Prof. Venansius Baryamureeba, PhD
Uganda Technology And Management University (UTAMU)
www.utamu.ac.ug/barya ; barya@utamu.ac.ug
12th Operations Research Society for Eastern Africa (ORSEA) Conference, October 20-21, 2016, Hosted at the Faculty of Computing and Management Science Building, Makerere University Business School (MUBS), Kampala Uganda
About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.
Predictive Analytics: Context and Use Cases
Historical context for successful implementation of predictive analytic techniques and examples of implementation of successful use cases.
Process-Centric Governance and Information ArchitectureSimon Rawson
All content is produced by processes, intended for the support or consumption of other processes. This is a premise I have propounded for over a decade. I have challenged thousands of people over nearly a decade to disprove this statement and offered $100 to anyone who can find an example which proves otherwise. I still have that $100 tucked away.
This presentation shows a high level process-centric information architecture, and tools to map processes and associate the content inputs and outputs. It shows examples of governance structures for ECM/KM projects, and the topics a governance strategy/plan should cover.
Finally, lessons learned about the common characteristics of highly successful ECM/KM projects are described.
Operations Research and ICT A Keynote AddressElvis Muyanja
By Prof. Venansius Baryamureeba, PhD
Uganda Technology And Management University (UTAMU)
www.utamu.ac.ug/barya ; barya@utamu.ac.ug
12th Operations Research Society for Eastern Africa (ORSEA) Conference, October 20-21, 2016, Hosted at the Faculty of Computing and Management Science Building, Makerere University Business School (MUBS), Kampala Uganda
About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.
Predictive Analytics: Context and Use Cases
Historical context for successful implementation of predictive analytic techniques and examples of implementation of successful use cases.
Process-Centric Governance and Information ArchitectureSimon Rawson
All content is produced by processes, intended for the support or consumption of other processes. This is a premise I have propounded for over a decade. I have challenged thousands of people over nearly a decade to disprove this statement and offered $100 to anyone who can find an example which proves otherwise. I still have that $100 tucked away.
This presentation shows a high level process-centric information architecture, and tools to map processes and associate the content inputs and outputs. It shows examples of governance structures for ECM/KM projects, and the topics a governance strategy/plan should cover.
Finally, lessons learned about the common characteristics of highly successful ECM/KM projects are described.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
3. /faculteit technologie management
Why Data Mining
• Cascade of data
– Different growth rates, but about 30% each year is a
low growth rate estimation
• The possibility to use computers to analyze data
– 1975 computer for the whole university (main frame)
with 1MB working memory, now a PC with 512 MB
working memory
4. /faculteit technologie management
Cascade of data
Business and government systems (transactions
system, ERP systems, Workflow systems, ...)
Scientific data: astronomy, biology, etc
Web, text, and e-commerce (new regularities, about
data storage to prevent attempts)
Hospitals, internal revenue service
...
5. /faculteit technologie management
Examples large data bases
• AT&T handles billions of calls per day
– so much data, it cannot be all stored -- analysis has to
be done “on the fly”
• Europe's Very Long Baseline Interferometry
(VLBI) has 16 telescopes, each of which
produces 1 Gigabit/second of astronomical
data over a 25-day observation session
• Google
6. /faculteit technologie management
First conclusion
• Very little data will ever be looked at by a human
• Data Mining algorithms and computers are
NEEDED to make sense and use of data.
8. /faculteit technologie management
Application examples I
• Customer Relationship Management (CRM)
– Based on a data base with client information and
behavior try to select other potential consumers of a
product.
– Euro miles.
• Profiling tax cheaters
– Based on the profile of the tax payer and some figures
from the tax (electronic) form try to product tax
cheating.
9. /faculteit technologie management
Application examples II
• Health care
– Given the patient profile and the diagnoses try to
predict the number of hospital days. Information is
used in planning system.
• Industry
– Job shop planning. Based on already accepted jobs,
try to product the delivery time of a new offered job.
10. /faculteit technologie management
Type of applications
• Classification (supervised)
– Credit risk: result of data mining are rules that can be used
to classify new clients as: high, normal, low
• Estimation (supervised)
– Credit risk: output is not a classification but a number
between -1 and 1 to indicate risk (-1.0 very low, 0.0 normal,
+1.0 very high)
• Clustering (unsupervised)
• Associations: e.g. Bier & Chips & Peanuts occur
frequently in a shopping list of one person
• Visualization: to facilitate human discovery
11. /faculteit technologie management
Supervised verses unsupervised
• Supervised (Credit risk)
– Starting point is a historical data base with client
information and his/her financial data including credit
history (classification). This data base is used to
induce credit risk rules.
• Unsupervised (Clustering)
– Try to cluster customers into similar groups (how
many groups, in which sense similar)
12. /faculteit technologie management
E-commerce – Case Study
• A person buys a book (product) at Amazon.com.
• Task: Recommend other books (products) this
person is likely to buy
• Amazon does clustering based on books bought:
– customers who bought “Advances in Knowledge Discovery and
Data Mining”, also bought “Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations”
• Recommendation program is quite successful
13. /faculteit technologie management
Hands-on-project I
• Historical consumer data
– Age, education, sex, relationship,
etc.
– Income
• Model to predict income above
50K
• Use the model to select
consumers for direct mailing
14. /faculteit technologie management
Problems Suitable for Data-Mining
• have sub-optimal current methods
• have accessible, sufficient, and relevant data
• provides high payoff for the right decisions!
• (have a changing environment)
16. /faculteit technologie management
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
– valid
– novel
– potentially useful
– and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
18. /faculteit technologie management
Statistics, Machine Learning and
Data Mining
• Statistics:
– more theory-based
– more focused on testing hypotheses
• Machine Learning
– more heuristics then theory-based
– focused on improving performance of a learning algorithms
• Data Mining and Knowledge Discovery
– Data Mining one step in the Knowledge Discovery process (applying
the Machine Learning algorithm)
– Knowledge Discovery, the whole process including data cleaning,
learning, and integration and visualization of results
• Distinctions are fuzzy
19. /faculteit technologie management
Knowledge Discovery Process
flow, according to CRISP-DM
Monitoring
Business
Understanding
+ Data
Understanding
+ Data
Preparation
80% of the time
Modeling
(applying mining
algorithm) 20%
20. /faculteit technologie management
Business
Understanding
Data
Understanding
Evaluation
Data
Preparation
Modeling
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Collect Initial Data
Initial Data Collection
Report
Describe Data
Data Description Report
Explore Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Set
Data Set Description
Select Data
Rationale for Inclusion /
Exclusion
Clean Data
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Deployment
Phases and Tasks
21. /faculteit technologie management
Other related fields
• Data warehouse
– A data warehouse thus not contain simply accumulated
data at a central point, but the data is carefully assembled
from a variety of information sources around the
organization, cleaned u, quality assured, and then released
(published).
• Business Intelligence (BI)
– The use of data in the data ware house to support the
managers with important information
23. /faculteit technologie management
Data Mining versus Process Mining
• Process Mining is data mining but with a strong
business process view.
• Some of the more traditional data mining
techniques can be used in the context of
process mining.
• Some new techniques are developed to perform
process mining (mining of process models).
24. /faculteit technologie management
Why Process Mining
• Traditional As-Is analysis of business processes strongly
based on the opinion of process expert. The basic idea
is to assemble an appropriate team and to organize
modeling sessions in which the knowledge of the team
members is used to build an adequate As-Is process
model.
• The surplus values of process mining in the As-Is
analysis are:
– information based on the real performance of the process
(objective)
– more details