Team Foundation Server - Tracking & ReportingSteve Lange
Comprehensive presentation detailing reporting and tracking capabilities of Team Foundation Server. Focuses on Excel workbooks and Reporting Services, but touches on other technologies as well.
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Team Foundation Server - Tracking & ReportingSteve Lange
Comprehensive presentation detailing reporting and tracking capabilities of Team Foundation Server. Focuses on Excel workbooks and Reporting Services, but touches on other technologies as well.
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
Impact Award lecture at MTAGS/ACM SC'12. URL: http://datasys.cs.iit.edu/events/MTAGS12/biggest-impact-award.html
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Balancing the Pendulum: Reflecting on BDD in PracticeZach Dennis
Here are the slides that I used for my talk on "Balancing the pendulum: Reflecting on BDD in Practice" at the Great Lakes Ruby Bash, April 17th, 2010 in Lansing, MI.
This talk was aimed at sharing my reflections and experiences on how BDD and outside-in development has been working over the past few years.
While Hadoop is the dominant "Big Data" tool suite today, it's a first-generation technology. I discuss its strengths and weaknesses, then look at how we "should" be doing Big Data and currently-available alternative tools.
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACESsipij
Many modern online 3D applications and videogames rely on parametric models of human faces for
creating believable avatars. However, manually reproducing someone's facial likeness with a parametric
model is difficult and time-consuming. Machine Learning solution for that task is highly desirable but is
also challenging. The paper proposes a novel approach to the so-called Face-to-Parameters problem (F2P
for short), aiming to reconstruct a parametric face from a single image. The proposed method utilizes
synthetic data, domain decomposition, and domain adaptation for addressing multifaceted challenges in
solving the F2P. The open-sourced codebase illustrates our key observations and provides means for
quantitative evaluation. The presented approach proves practical in an industrial application; it improves
accuracy and allows for more efficient models training. The techniques have the potential to extend to
other types of parametric models.
Domain Engineering for Applied Monocular Reconstruction of Parametric Facessipij
Many modern online 3D applications and videogames rely on parametric models of human faces for
creating believable avatars. However, manually reproducing someone's facial likeness with a parametric
model is difficult and time-consuming. Machine Learning solution for that task is highly desirable but is
also challenging. The paper proposes a novel approach to the so-called Face-to-Parameters problem (F2P
for short), aiming to reconstruct a parametric face from a single image. The proposed method utilizes
synthetic data, domain decomposition, and domain adaptation for addressing multifaceted challenges in
solving the F2P. The open-sourced codebase illustrates our key observations and provides means for
quantitative evaluation. The presented approach proves practical in an industrial application; it improves
accuracy and allows for more efficient models training. The techniques have the potential to extend to
other types of parametric models.
My presentation held at the 1st European Conference on Political Attitudes and Mentalities (ECPAM 2012) conference, Bucharest, Romania, September 3-5, 2012.
Electronic paper link:
http://mass.aitia.ai/images/publikaciok/2012-ecpam-replication_case_studies-camera_ready.pdf
Abstract: This paper examines model replication in the context of agent-based simulation through two case studies. Replication of a computational model and validation of its results is an essential tool for scientific researchers, but it is rarely used by modelers. In our work we address the question of validating and verifying simulations in general, and summarize our experience in approaching different models through replication with different motivations. Two models are discussed in details. The first one is an agent-based spatial adaptation of a numerical model, while the second experiment addresses the exact replication of an existing economic model.
A comparative review of various approaches for feature extraction in Face rec...Vishnupriya T H
Four feature extraction algorithms are discussed here.
1. Principal Component Analysis
2. Discreet LInear Transform
3. Independent Component Analysis
4. Linear Discriminant Aalysis
In this paper we deal with the impact of multi and many-core processor architectures on simulation. Despite the fact that modern CPUs have an increasingly large number of cores, most softwares are still unable to take advantage of them. In the last years, many tools, programming languages and general methodologies have been proposed to help building scalable applications for multi-core architectures, but those solutions are somewhat limited. Parallel and distributed simulation is an interesting application area in which efficient and scalable multi-core implementations would be desirable. In this paper we investigate the use of the Go Programming Language to implement optimistic parallel simulations based on the Time Warp mechanism. Specifically, we describe the design, implementation and evaluation of a new parallel simulator. The scalability of the simulator is studied when in presence of a modern multi-core CPU and the effects of the Hyper-Threading technology on optimistic simulation are analyzed.
Science has escaped the lab and is roaming free in the world. People use software to understand the world . What tools are needed to support that work?
More Related Content
Similar to Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
Impact Award lecture at MTAGS/ACM SC'12. URL: http://datasys.cs.iit.edu/events/MTAGS12/biggest-impact-award.html
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Balancing the Pendulum: Reflecting on BDD in PracticeZach Dennis
Here are the slides that I used for my talk on "Balancing the pendulum: Reflecting on BDD in Practice" at the Great Lakes Ruby Bash, April 17th, 2010 in Lansing, MI.
This talk was aimed at sharing my reflections and experiences on how BDD and outside-in development has been working over the past few years.
While Hadoop is the dominant "Big Data" tool suite today, it's a first-generation technology. I discuss its strengths and weaknesses, then look at how we "should" be doing Big Data and currently-available alternative tools.
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACESsipij
Many modern online 3D applications and videogames rely on parametric models of human faces for
creating believable avatars. However, manually reproducing someone's facial likeness with a parametric
model is difficult and time-consuming. Machine Learning solution for that task is highly desirable but is
also challenging. The paper proposes a novel approach to the so-called Face-to-Parameters problem (F2P
for short), aiming to reconstruct a parametric face from a single image. The proposed method utilizes
synthetic data, domain decomposition, and domain adaptation for addressing multifaceted challenges in
solving the F2P. The open-sourced codebase illustrates our key observations and provides means for
quantitative evaluation. The presented approach proves practical in an industrial application; it improves
accuracy and allows for more efficient models training. The techniques have the potential to extend to
other types of parametric models.
Domain Engineering for Applied Monocular Reconstruction of Parametric Facessipij
Many modern online 3D applications and videogames rely on parametric models of human faces for
creating believable avatars. However, manually reproducing someone's facial likeness with a parametric
model is difficult and time-consuming. Machine Learning solution for that task is highly desirable but is
also challenging. The paper proposes a novel approach to the so-called Face-to-Parameters problem (F2P
for short), aiming to reconstruct a parametric face from a single image. The proposed method utilizes
synthetic data, domain decomposition, and domain adaptation for addressing multifaceted challenges in
solving the F2P. The open-sourced codebase illustrates our key observations and provides means for
quantitative evaluation. The presented approach proves practical in an industrial application; it improves
accuracy and allows for more efficient models training. The techniques have the potential to extend to
other types of parametric models.
My presentation held at the 1st European Conference on Political Attitudes and Mentalities (ECPAM 2012) conference, Bucharest, Romania, September 3-5, 2012.
Electronic paper link:
http://mass.aitia.ai/images/publikaciok/2012-ecpam-replication_case_studies-camera_ready.pdf
Abstract: This paper examines model replication in the context of agent-based simulation through two case studies. Replication of a computational model and validation of its results is an essential tool for scientific researchers, but it is rarely used by modelers. In our work we address the question of validating and verifying simulations in general, and summarize our experience in approaching different models through replication with different motivations. Two models are discussed in details. The first one is an agent-based spatial adaptation of a numerical model, while the second experiment addresses the exact replication of an existing economic model.
A comparative review of various approaches for feature extraction in Face rec...Vishnupriya T H
Four feature extraction algorithms are discussed here.
1. Principal Component Analysis
2. Discreet LInear Transform
3. Independent Component Analysis
4. Linear Discriminant Aalysis
In this paper we deal with the impact of multi and many-core processor architectures on simulation. Despite the fact that modern CPUs have an increasingly large number of cores, most softwares are still unable to take advantage of them. In the last years, many tools, programming languages and general methodologies have been proposed to help building scalable applications for multi-core architectures, but those solutions are somewhat limited. Parallel and distributed simulation is an interesting application area in which efficient and scalable multi-core implementations would be desirable. In this paper we investigate the use of the Go Programming Language to implement optimistic parallel simulations based on the Time Warp mechanism. Specifically, we describe the design, implementation and evaluation of a new parallel simulator. The scalability of the simulator is studied when in presence of a modern multi-core CPU and the effects of the Hyper-Threading technology on optimistic simulation are analyzed.
Similar to Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation (20)
Science has escaped the lab and is roaming free in the world. People use software to understand the world . What tools are needed to support that work?
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
Multi-objective evolutionary algorithms (MOEAs) help software engineers find novel solutions to complex problems. When automatic tools explore too many options, they are slow to use and hard to comprehend. GALE is a near-linear time MOEA that builds a piecewise approximation to the surface of best solutions along the Pareto frontier. For each piece, GALE mutates solutions towards the better end. In numerous case studies, GALE finds comparable solutions to standard methods (NSGA-II, SPEA2) using far fewer evaluations (e.g. 20 evaluations, not 1,000). GALE is recommended when a model is expensive to evaluate, or when some audience needs to browse and understand how an MOEA has made its conclusions.
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...CS, NcState
Discussions about sharing
- Too much fear
- Not enough about benefits
Can we learn more from sharing that hoarding ?
- Yes (results from SE)
Three laws of trusted data sharing:
- For SE quality prediction..
- Better models from shared privatized data that from all raw data
Q: does this work for other kinds of data?
A: don’t know… yet
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
SA @ WV(software assurance research at West Virginia)
Kenneth McGill
NASA IV&V Facility Research Lead
304.367.8300
Kenneth.McGill@ivv.nasa.gov
Dr. Tim Menzies Ph.D. (WVU)
Software Engineering Research Chair
tim@menzies.us
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
Q: How have dummies (like me) managed to gain (some) control over a (seemingly) complex world?
A:The world is simpler than we think.
◆ Models contain clumps
◆ A few collar variables decide which clumps to use.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation
1. Size Doesn’t Matter?
On the Value of Software Size
Features for Effort Estimation
Ekrem Kocaguneli, Tim Menzies : WVU,USA
Jairus Hihn : JPL, USA
Byeong Ho Kang : UTAS, Aus
2. Sept
2012
Sound bites
Size matters!
But, lack of size features can be tolerated
• caveat: need to first prune irrelevancies
PROMISE’12 2
3. Sept
2012
Role of Size Features in SEE
Size features are at the
heart of some of the most
widely used SEE methods
COCOMO is based on LOC
Function points (FP) is based on
logical transactions
Various others exist such as number of
requirements, number of modules, number of
web pages and so on…
PROMISE’12 3
4. Sept
2012
Role of Size Features in SEE (cntd.)
Size features have their advantages and disadvantages
LOC can be automated for counting and is good a
posteriori, but is difficult to estimate early on
FP provides a way of a size metric based on early design
information; hence more accurate a priori
FP cannot be automated and is subjective… Even though
training reduces the estimate variation
PROMISE’12 4
5. Sept
2012
Objections to Size Features
Although particular size features may have their advantages
in certain scenarios, there is a strong opposition…
“Measuring software productivity by lines of code is like
measuring progress on an airplane by how much it weighs.”
Bill Gates
“This (referring to LOC) is a very costly measuring unit because
it encourages the writing of insipid code, but today I am less
interested in how foolish a unit it is from even a pure business
point of view.” E. W. Dijkstra
So we question: Under what conditions are size features
actually a “must” and can we compensate their absence?
PROMISE’12 5
6. Sept
2012
So let’s check…
If we throw away size attributes,
what happens?
PROMISE’12 6
7. Sept
2012
If we remove “size”,
what happens?
Compare standard successful methods run on reduced and
full data sets, using 7 error measures and 13 data sets…
Full data set includes size features
Reduced data sets lacks size features
Methods Error Measures Datasets
Cocomo81 Nasa93 Sdr
CART MAR
Cocomo81o Nasa93c1 Desharnais
1NN MMRE
Cocomo81e Nasa93c2 DesharnaisL1
MdMRE
Cocomo81s Nasa93c5 DesharnaisL2
Pred(25)
DesharnaisL3
MMER
MBRE
MIBRE
PROMISE’12 7
8. Sept
2012
Evaluation
(cntd.)
Methods Error Measures Datasets
pop1NN MAR Cocomo81 Nasa93 Sdr
CART MMRE Cocomo81o Nasa93c1 Desharnais
1NN Cocomo81e Nasa93c2 DesharnaisL1
MdMRE
Cocomo81s Nasa93c5 DesharnaisL2
Pred(25)
DesharnaisL3
MMER
Compare pop1NN
against CART & MBRE On multiple data sets
1NN MIBRE collected via COCOMO,
COCOMOII and FP
Using 7
error
Why CART? measures Mann-Whitney 95%
Dejaeger et al.
TSE 2012
PROMISE’12 8
9. Sept
2012
Results
(full data has “size”, reduced has not)
CART on reduced-dataset
vs. CART on full-dataset
Last column shows
total loss count of
CART run on reduced
dataset (i.e. no size
features)
In 7 of 13 tests, taking
out size makes CART
perform worse
PROMISE’12 9
10. Sept
2012 Results
(full data has “size”, reduced has not)
Total loss counts of CART
and 1NN run on reduced
data vs. their variants run
on full data…
Standard methods are better
off with size attributes of the
data sets… I.e. they cannot
compensate for the lack of
size attributes well
(copied from
PROMISE’12 last slide) 10
11. Sept
2012
New idea
If we prune data irrelevancies,
can we survive losing size attributes?
PROMISE’12 11
12. Sept
2012
Instance selection
• Chang (1974)
– Most of the instances are uninformative.
– Reduced data sets of size 514, 150, 66 to 34, 14,6 prototypes .
• Li et al. (2009)
– genetic algorithm for instance selection
• Turhan et al. (2009)
– instance selection as a filter for cross-company defect data
– See also, Kocaguneli et al. 2011
• Kocaguneli et al. (2011) variance-based selection:
– Dendogram of clusters: prune sub-trees with large variances
• Keung et al.’s (2011) Analogy-X
– instance selection method for analogous entry
• New idea, 1popNN : a very simple instance selector
PROMISE’12 12
13. Sept
2012
pop1NN : the urchin shape
We propose that a “popularity” based method can
compensate the lack of size features
The “popularity” of an instance
is the number of times it is the
nearest-neighbor of other
instances
Sea urchin is a good
example for SEE data…
Popular central
instances that are
closest neighbors to
scattered neighbors…
PROMISE’12 13
14. Sept
2012
Formally, this is rNN
• rNN =
– Reverse Nearest Neighbor
– E.g. how many residential areas would find a new store as their nearest choice.
– E.g. predict popularity of a new cell phone plan, determine how many profiles
have the plan as their best match, against the existing plans in the market.
• Can be computed efficiently (rNN chaining)
– see Lopez-Sastre et al.,
– Fast Reciprocal Nearest Neighbors Clustering,
– Signal Processing, 2012, Vol. 92, pages 270—275)
PROMISE’12 14
15. Sept
2012
So let’s check…
If we (1) throw away size attributes
and (2) irrelevant rows,
then what happens?
PROMISE’12 15
16. Sept
2012
Details:
pop1NN (cntd.)
pop1NN is a 6-step procedure…
1. Calculate distances between every training instance-tuple
2. Convert distances of Step 1 into ordering of neighbors
3. Mark closest neighbors and calculate popularity
4. Order training instances in decreasing popularity
5. Decide which instances to select
• Experiments with nearest neighbor on a hold-out set
6. Return Estimates for the test instances
PROMISE’12 16
17. Sept
2012 Results
(reduced data)
Loss values of
pop1NN (on reduced
data) vs. CART and
1NN (on full data)
pop1NN loses 2 out of 13 data sets against 1NN
pop1NN loses 4 out of 13 data sets against 1NN
PROMISE’12 17
19. Sept
2012
Conclusions
Successful methods (1NN & CART) cannot compensate the
lack of size attributes very well
Lack of size features decreases their performance in majority of
the data sets
When 1NN is augmented with a popularity-based pre-
processor to come up with pop1NN
Lack of size features can be tolerated in most of the datasets
Caveat: need to first prune irrelevancies
Size features are essential for standard learners
Practitioners with enough resources to correctly collect size
features should do so
In the lack of such resources, pop1NN-like methods can
compensate for the lack of the size features
PROMISE’12 19
20. Sept
2012
Future Work
• Pop1NN as a feature selector?
– Lipowezky (1998) :
• feature and case selection are
similar tasks,
• both remove cells in the
hypercube of all instances
times all features.
– So it should be possible to convert
a case selection mechanism into a
feature selector.
• Transpose data
• Nearby columns are correlated
• Keep columns that are near no other
• Active learning:
– pop1NN does not use dependent variable information.
– can identify the popular instances of a data set, guide expert reflection on
collect dependent variable information
PROMISE’12 20