SlideShare a Scribd company logo
1 of 24
Download to read offline
ÉCOLE POLYTECHNIQUE DE MONTRÉAL



  Concept Location with Genetic
Algorithms: A Comparison of Four
    Distributed Architectures


             Fatemeh Asadi
            Giuliano Antoniol
          Yann-Gaël Guéhéneuc
Content
 Motivations
 Literature Review
 Background
    Getting Traces, Trace Pruning and Compression
    Textual Analysis of Method Source Code
    Applying Search-based Optimization Technique
 Parallelizing GAs
    Rationale of the Architectures
    Simple Client Server Architecture
    Database Configuration
    Hash-database Configuration
    Hash Configuration
 Summery of Results
 Conclusion and Future Work

                                                    2
Motivations

 Genetic algorithms (GAs) are attractive to solve many search-based
 software engineering problems
 GAs are effective in finding approximate solutions when
   The search space is large or complex
   No mathematical analysis or traditional methods are available
   The problem to be solved is NP-complete or NP-hard


 GAs allow the parallelization of computations
   The evaluation of the fitness function is often performed on each
   individual in isolation
   Parallelizing the computations improves scalability and reduces
   computation time
                                                                       3
Motivations


 In this paper, we report our experience in distributing the
 computation of a fitness function to parallelize a GA to
 solve the concept location problem.

 We developed, tested, and compared four different
 configuration of distributed architectures to resolve the
 scalability issues of our concept location approach
     A client (master) performs GA operations and distributes the
    individuals among servers
    Servers (slaves) perform the computations of the fitness
    function
                                                                    4
Literature Review


 Mitchal et al. used a distributed hill-climbing to re-modularize large
 systems by grouping together related components by means of
 clustering techniques
 Mahdavi et al. used a distributed hill-climbing for module clustering


 Parallel GAs are classified into three categories (Stender et al.)
    Global parallelization: only the evaluation of the individuals’ fitness is
    parallelized (our approach)
    Coarse-grained parallelization (island model): a computer divides a
    population into subpopulations and assigns each to another computer.
    A GA is executed on each sub-population
    Fine-grained parallelization: each individual is assigned to a computer
    and all the GA operations are performed in parallel
                                                                                 5
Background

 Our proposed concept location approach consists of the
 following steps:
   Step I – System instrumentation
   Step II – Execution trace collection
   Step III – Trace pruning and compression
   Step IV – Textual analysis of method source code
   Step V – Search-based concept location




                                                          6
Step I and Step II – Getting the Traces


 Step I – System instrumentation
    It is performed to generate the trace of method invocations
      We used our instrumentor, MoDeC



 Step II – Execution trace collection
    We exercise the system following scenarios taken from user
    manuals or use-case descriptions



                                                                  7
Step III – Trace Pruning and Compression
  Trace pruning
     Too frequent methods occurring in many scenarios are not related to any
     particular concept (they are often utility methods)
        We remove methods having a frequency
        Q3 + 2 × IQR (75% percentile + 2 × the inter-quartile range)


  Trace compression
     We “collapse” repetitions in execution traces to reduce the search space
     Examples:
        m1(); m1(); m1();                =>          m1();
        m1(); m2(); m1(); m2();          =>          m1(); m2();
     We perform the collapsing using the Run Length Encoding (RLE)
     We apply the RLE for sub-sequences having an arbitrary length

                                                                                8
Step IV- Textual Analysis of Method Source Code

   The data provided in this part are used to calculate conceptual
   cohesion and coupling as described by Marcus et al. in 2008 and
   Poshyvanyk et al. in 2006.


   Our assumptions are
    1) Methods helping to implement a concept are likely to share
       some linguistic information (see Poshyvanyk et al., 2006)

    2) Methods responsible to implement a concept are likely to be
       called close to each other in an execution trace.

                                                                     9
Step IV – Textual Analysis of Method Source Code
Steps:
   Extraction of identifiers and comment words
   Camel-case splitting of composed identifiers: visitedNode -> visited & node
   Stop word removal (English + Java keywords)
   Stemming using the Porter stemmer: Visited -> visit
   Term-document matrix generation
     Considering each method is as a document
     Indexing using tf–idf
   Reducing the term-document space into a concept-document space using
   Latent Semantic Indexing (LSI)
     To cope with synonymy and polysemy
                                                                            10
Step V – Search-based Concept Location
    We use a search-based optimization technique based on a GA to split traces into
    segments that, we believe, represent concepts
    Representation: a bit-vector where 1 indicates the end of a segment

                  Trace splitting       m1 m2 m1 m3 m4 m1 m4 m6 m1

                  Representation        0    1   0   0   1   0       0       0       1

    Mutation: randomly flips a bit (i.e., splits or merge segments)
0    1   0    0    1    0   0       0    1                       0       0
                                                                         1       0       0   1   0   0   0    1

    Crossover: two-points
0    1   0    0    1    0   0       0    1                       0       1       0       0   0   1   0   0    1

0    0   1    0    0    1   0       0    1                       0       0       1       0   1   0   0   0    1

                                                                                                             11
    Selection: roulette wheel
Step V – Search-based Concept Location
  Fitness Function:



  Segment Cohesion is the average (textual) similarity between any pair
  of methods in a segment
  Segment Coupling is the average (textual) similarity between a segment
  and all other segments in the trace
  Other GA parameters
     200 individuals
     2,000 generations for JHotDraw and 3,000 for ArgoUML
     5% mutation probability, 70% crossover probability
                                                                     12
Parallelizing the GA
 Single-computer architecture come from an optimized implementation of our
 approach.
     We modified GALib to compute only the fitness values of individuals that
     have changed between the last generation and the current one.
     We thus obtained about 30% of computation-time decrease

 The computations were still time consuming
     Running an experiment with a short trace, containing 240 method invocations
     (Start–DrawRectangle–Stop), with 2,000 iterations lasted about 12 hours

 Solution: Parallelizing the computations

 Global parallelization:
     A master computer (client)
     Several slave computers (servers)

 Client server architectural style, on a TCP/IP network, with 9 servers
                                                                                   13
Rationale behind the Architectures

 Several individuals share some segments
    The first two segments of individuals Id1, Id2, and Id5 are
    identical, as shown below




 Once the fitness value of one individual is computed, if
 segment cohesion and coupling are stored, then they can
 be reused to compute the fitness values of others                14
Simple Client Server Architecture
that the
computations are have
 The servers very time consuming.
    No local memory
    No global shared memory
    No communication between themselves
 Each server uses its own local LSI matrix

 In the case of Start-DrawRectangle-
 stop trace, this architecture reduces
 the execution time to about two hours



                                             15
Database Configuration
 A large fragment of segments are preserved unchanged in the next
 generation

A central storage keeps a history of
the previous computations and share
the computation results
No calculation is done more than
once
Each computer benefits from others’
computations and does not repeat
what has been done once by others




                                                                    16
Database Configuration

 In the case of Start-
 DrawRectangle-stop trace, this
 architecture increases the
 computation time to 13 hours!


 Accessing to a database located on a
 different server is time consuming
 Database access issues
    Reading, writing, locking




                                        17
Hash-Database Configuration
A centralized and local storage are put together to decrease the numbers of
accesses to the central database by keeping cached data in a local storage
structure
Not much better than the last one




                                                                              18
Hash Configuration

A local storage
    Only local data is stored in the local
    hash table of servers.
    No data is shared among servers
    Servers only communicate with the
    client
    There is no access policy using locking
    algorithms
    The access to the already-computed
    data as well as their storage is efficient
    It enormously reduces the time.
    In the case of Start-DrawRectangle-
    stop trace, this architecture reduces
    the execution time to about 5
    minutes                                      19
Summery of Results

Computation time for desktop solution
and the distributed architectures with
the Start-DrawRectangle-stop trace



                                         Computation time for desktop solution
                                         and a hash-table distributed architecture
                                         with the Start-Spawn-Window-Draw-
                                         Circle-Stop”trace




                                                                                     20
Conclusion
 We presented and discussed four client–server architectures conceived
 to improve performance and reduce GA computation times to solve
 the concept location problem


 We discovered that on a standard TCP/IP network, the overhead of
 database accesses, communication, and latency may impair dedicated
 solutions


 In our experiments, the fastest solution was an architecture where each
 server kept track only of its computations without exchanging data
 with other servers
     This simple architecture reduced GA computation by about 140 times
     when compared to a simple implementation, in which all GA
                                                                           21
     operations are performed on a single machine
Future Work


 We intend to experiment different communication
 protocols (e.g., UDP) and synchronization strategies

 We will carry out other empirical studies to evaluate the
 approach on more traces, obtained from different systems,
 to verify the generality of our findings

 We will reformulate other search-based software
 engineering problems to exploit parallel computation to
 verify further our findings
                                                             22
Thank You!




             Questions?
                          23
Implementation and Environment
Two traces collected by instrumenting JHotDraw and executing the scenarios Start-
DrawRectangle-Quit and Start-Spawn-Window-Draw-Circle-Stop
Final length of the traces after removing utility methods and compression are
respectively 240 and 432 method invocations
Computations are distributed over a sub-network of 14 workstations
 Five high-end workstations, the most powerful ones, are connected in a Gigabit
Ethernet LAN
 Low-end workstations are connected to a LAN segment at 100 MBit/s and talk
among themselves at 100 Mbit/s
 Each experience was run on a subset of ten computers: nine servers and one client
Workstations run CentOS v5 64 bits
Memory varies between four to 16 Gbytes
Workstations are based on Athlon X2 Dual Core Processor 4400
The five high-end workstations are either single or dual Opteron
Workstations run basic Unix services and user processes
The client computer was also responsible to measure execution times and to verify the
liveness of connections
Connections to servers as well as connections to the database were implemented on
                                                                                        24
top of TCP/IP (AF INET) sockets
All components have been implemented in Java 1.5 64bits
The database server was MySQL server v5.0.77

More Related Content

What's hot

Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
VLSICS Design
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
IJCNCJournal
 
Seminar_New -CESG
Seminar_New -CESGSeminar_New -CESG
Seminar_New -CESG
Qian Wang
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
Richard Ashworth
 

What's hot (18)

B0330811
B0330811B0330811
B0330811
 
A0360109
A0360109A0360109
A0360109
 
On The Application of Hyperbolic Activation Function in Computing the Acceler...
On The Application of Hyperbolic Activation Function in Computing the Acceler...On The Application of Hyperbolic Activation Function in Computing the Acceler...
On The Application of Hyperbolic Activation Function in Computing the Acceler...
 
Hardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVMHardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVM
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 
A NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYS
A NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYSA NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYS
A NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYS
 
DSR Routing Decisions for Mobile Ad Hoc Networks using Fuzzy Inference System
DSR Routing Decisions for Mobile Ad Hoc Networks using Fuzzy Inference SystemDSR Routing Decisions for Mobile Ad Hoc Networks using Fuzzy Inference System
DSR Routing Decisions for Mobile Ad Hoc Networks using Fuzzy Inference System
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
 
4213ijaia02
4213ijaia024213ijaia02
4213ijaia02
 
Seminar_New -CESG
Seminar_New -CESGSeminar_New -CESG
Seminar_New -CESG
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Paper on experimental setup for verifying - "Slow Learners are Fast"
Paper  on experimental setup for verifying  - "Slow Learners are Fast"Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Paper on experimental setup for verifying - "Slow Learners are Fast"
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
A046020112
A046020112A046020112
A046020112
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
MapReduce based SVM
MapReduce based SVMMapReduce based SVM
MapReduce based SVM
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machines
 

Viewers also liked (7)

Increase pinterest follower
Increase pinterest followerIncrease pinterest follower
Increase pinterest follower
 
ICSE12 SEE.ppt
ICSE12 SEE.pptICSE12 SEE.ppt
ICSE12 SEE.ppt
 
Soundcloud badge
Soundcloud badgeSoundcloud badge
Soundcloud badge
 
WBT07.ppt
WBT07.pptWBT07.ppt
WBT07.ppt
 
130531 francis nahm - on the evolution of antipatterns genealogies
130531   francis nahm - on the evolution of antipatterns genealogies130531   francis nahm - on the evolution of antipatterns genealogies
130531 francis nahm - on the evolution of antipatterns genealogies
 
How to use Evernote
How to use EvernoteHow to use Evernote
How to use Evernote
 
AsianPLoP'14: How and Why Design Patterns Impact Quality and Future Challenges
AsianPLoP'14: How and Why Design Patterns Impact Quality and Future ChallengesAsianPLoP'14: How and Why Design Patterns Impact Quality and Future Challenges
AsianPLoP'14: How and Why Design Patterns Impact Quality and Future Challenges
 

Similar to SSBSE10.ppt

Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Mumbai Academisc
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
Martin Pinzger
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
Bharat Rane
 

Similar to SSBSE10.ppt (20)

Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
CSMR10a.ppt
CSMR10a.pptCSMR10a.ppt
CSMR10a.ppt
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Scimakelatex.93126.cocoon.bobbin
Scimakelatex.93126.cocoon.bobbinScimakelatex.93126.cocoon.bobbin
Scimakelatex.93126.cocoon.bobbin
 
Csmr10a.ppt
Csmr10a.pptCsmr10a.ppt
Csmr10a.ppt
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed Simulation
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Crypto
CryptoCrypto
Crypto
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
GRID COMPUTING
GRID COMPUTINGGRID COMPUTING
GRID COMPUTING
 
Model checking
Model checkingModel checking
Model checking
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 

More from Ptidej Team

More from Ptidej Team (20)

From IoT to Software Miniaturisation
From IoT to Software MiniaturisationFrom IoT to Software Miniaturisation
From IoT to Software Miniaturisation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation by Lionel Briand
Presentation by Lionel BriandPresentation by Lionel Briand
Presentation by Lionel Briand
 
Manel Abdellatif
Manel AbdellatifManel Abdellatif
Manel Abdellatif
 
Azadeh Kermansaravi
Azadeh KermansaraviAzadeh Kermansaravi
Azadeh Kermansaravi
 
Mouna Abidi
Mouna AbidiMouna Abidi
Mouna Abidi
 
CSED - Manel Grichi
CSED - Manel GrichiCSED - Manel Grichi
CSED - Manel Grichi
 
Cristiano Politowski
Cristiano PolitowskiCristiano Politowski
Cristiano Politowski
 
Will io t trigger the next software crisis
Will io t trigger the next software crisisWill io t trigger the next software crisis
Will io t trigger the next software crisis
 
MIPA
MIPAMIPA
MIPA
 
Thesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.pptThesis+of+laleh+eshkevari.ppt
Thesis+of+laleh+eshkevari.ppt
 
Thesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.pptThesis+of+nesrine+abdelkafi.ppt
Thesis+of+nesrine+abdelkafi.ppt
 
Medicine15.ppt
Medicine15.pptMedicine15.ppt
Medicine15.ppt
 
Qrs17b.ppt
Qrs17b.pptQrs17b.ppt
Qrs17b.ppt
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
Icsoc15.ppt
Icsoc15.pptIcsoc15.ppt
Icsoc15.ppt
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 

SSBSE10.ppt

  • 1. ÉCOLE POLYTECHNIQUE DE MONTRÉAL Concept Location with Genetic Algorithms: A Comparison of Four Distributed Architectures Fatemeh Asadi Giuliano Antoniol Yann-Gaël Guéhéneuc
  • 2. Content Motivations Literature Review Background Getting Traces, Trace Pruning and Compression Textual Analysis of Method Source Code Applying Search-based Optimization Technique Parallelizing GAs Rationale of the Architectures Simple Client Server Architecture Database Configuration Hash-database Configuration Hash Configuration Summery of Results Conclusion and Future Work 2
  • 3. Motivations Genetic algorithms (GAs) are attractive to solve many search-based software engineering problems GAs are effective in finding approximate solutions when The search space is large or complex No mathematical analysis or traditional methods are available The problem to be solved is NP-complete or NP-hard GAs allow the parallelization of computations The evaluation of the fitness function is often performed on each individual in isolation Parallelizing the computations improves scalability and reduces computation time 3
  • 4. Motivations In this paper, we report our experience in distributing the computation of a fitness function to parallelize a GA to solve the concept location problem. We developed, tested, and compared four different configuration of distributed architectures to resolve the scalability issues of our concept location approach A client (master) performs GA operations and distributes the individuals among servers Servers (slaves) perform the computations of the fitness function 4
  • 5. Literature Review Mitchal et al. used a distributed hill-climbing to re-modularize large systems by grouping together related components by means of clustering techniques Mahdavi et al. used a distributed hill-climbing for module clustering Parallel GAs are classified into three categories (Stender et al.) Global parallelization: only the evaluation of the individuals’ fitness is parallelized (our approach) Coarse-grained parallelization (island model): a computer divides a population into subpopulations and assigns each to another computer. A GA is executed on each sub-population Fine-grained parallelization: each individual is assigned to a computer and all the GA operations are performed in parallel 5
  • 6. Background Our proposed concept location approach consists of the following steps: Step I – System instrumentation Step II – Execution trace collection Step III – Trace pruning and compression Step IV – Textual analysis of method source code Step V – Search-based concept location 6
  • 7. Step I and Step II – Getting the Traces Step I – System instrumentation It is performed to generate the trace of method invocations We used our instrumentor, MoDeC Step II – Execution trace collection We exercise the system following scenarios taken from user manuals or use-case descriptions 7
  • 8. Step III – Trace Pruning and Compression Trace pruning Too frequent methods occurring in many scenarios are not related to any particular concept (they are often utility methods) We remove methods having a frequency Q3 + 2 × IQR (75% percentile + 2 × the inter-quartile range) Trace compression We “collapse” repetitions in execution traces to reduce the search space Examples: m1(); m1(); m1(); => m1(); m1(); m2(); m1(); m2(); => m1(); m2(); We perform the collapsing using the Run Length Encoding (RLE) We apply the RLE for sub-sequences having an arbitrary length 8
  • 9. Step IV- Textual Analysis of Method Source Code The data provided in this part are used to calculate conceptual cohesion and coupling as described by Marcus et al. in 2008 and Poshyvanyk et al. in 2006. Our assumptions are 1) Methods helping to implement a concept are likely to share some linguistic information (see Poshyvanyk et al., 2006) 2) Methods responsible to implement a concept are likely to be called close to each other in an execution trace. 9
  • 10. Step IV – Textual Analysis of Method Source Code Steps: Extraction of identifiers and comment words Camel-case splitting of composed identifiers: visitedNode -> visited & node Stop word removal (English + Java keywords) Stemming using the Porter stemmer: Visited -> visit Term-document matrix generation Considering each method is as a document Indexing using tf–idf Reducing the term-document space into a concept-document space using Latent Semantic Indexing (LSI) To cope with synonymy and polysemy 10
  • 11. Step V – Search-based Concept Location We use a search-based optimization technique based on a GA to split traces into segments that, we believe, represent concepts Representation: a bit-vector where 1 indicates the end of a segment Trace splitting m1 m2 m1 m3 m4 m1 m4 m6 m1 Representation 0 1 0 0 1 0 0 0 1 Mutation: randomly flips a bit (i.e., splits or merge segments) 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 Crossover: two-points 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 11 Selection: roulette wheel
  • 12. Step V – Search-based Concept Location Fitness Function: Segment Cohesion is the average (textual) similarity between any pair of methods in a segment Segment Coupling is the average (textual) similarity between a segment and all other segments in the trace Other GA parameters 200 individuals 2,000 generations for JHotDraw and 3,000 for ArgoUML 5% mutation probability, 70% crossover probability 12
  • 13. Parallelizing the GA Single-computer architecture come from an optimized implementation of our approach. We modified GALib to compute only the fitness values of individuals that have changed between the last generation and the current one. We thus obtained about 30% of computation-time decrease The computations were still time consuming Running an experiment with a short trace, containing 240 method invocations (Start–DrawRectangle–Stop), with 2,000 iterations lasted about 12 hours Solution: Parallelizing the computations Global parallelization: A master computer (client) Several slave computers (servers) Client server architectural style, on a TCP/IP network, with 9 servers 13
  • 14. Rationale behind the Architectures Several individuals share some segments The first two segments of individuals Id1, Id2, and Id5 are identical, as shown below Once the fitness value of one individual is computed, if segment cohesion and coupling are stored, then they can be reused to compute the fitness values of others 14
  • 15. Simple Client Server Architecture that the computations are have The servers very time consuming. No local memory No global shared memory No communication between themselves Each server uses its own local LSI matrix In the case of Start-DrawRectangle- stop trace, this architecture reduces the execution time to about two hours 15
  • 16. Database Configuration A large fragment of segments are preserved unchanged in the next generation A central storage keeps a history of the previous computations and share the computation results No calculation is done more than once Each computer benefits from others’ computations and does not repeat what has been done once by others 16
  • 17. Database Configuration In the case of Start- DrawRectangle-stop trace, this architecture increases the computation time to 13 hours! Accessing to a database located on a different server is time consuming Database access issues Reading, writing, locking 17
  • 18. Hash-Database Configuration A centralized and local storage are put together to decrease the numbers of accesses to the central database by keeping cached data in a local storage structure Not much better than the last one 18
  • 19. Hash Configuration A local storage Only local data is stored in the local hash table of servers. No data is shared among servers Servers only communicate with the client There is no access policy using locking algorithms The access to the already-computed data as well as their storage is efficient It enormously reduces the time. In the case of Start-DrawRectangle- stop trace, this architecture reduces the execution time to about 5 minutes 19
  • 20. Summery of Results Computation time for desktop solution and the distributed architectures with the Start-DrawRectangle-stop trace Computation time for desktop solution and a hash-table distributed architecture with the Start-Spawn-Window-Draw- Circle-Stop”trace 20
  • 21. Conclusion We presented and discussed four client–server architectures conceived to improve performance and reduce GA computation times to solve the concept location problem We discovered that on a standard TCP/IP network, the overhead of database accesses, communication, and latency may impair dedicated solutions In our experiments, the fastest solution was an architecture where each server kept track only of its computations without exchanging data with other servers This simple architecture reduced GA computation by about 140 times when compared to a simple implementation, in which all GA 21 operations are performed on a single machine
  • 22. Future Work We intend to experiment different communication protocols (e.g., UDP) and synchronization strategies We will carry out other empirical studies to evaluate the approach on more traces, obtained from different systems, to verify the generality of our findings We will reformulate other search-based software engineering problems to exploit parallel computation to verify further our findings 22
  • 23. Thank You! Questions? 23
  • 24. Implementation and Environment Two traces collected by instrumenting JHotDraw and executing the scenarios Start- DrawRectangle-Quit and Start-Spawn-Window-Draw-Circle-Stop Final length of the traces after removing utility methods and compression are respectively 240 and 432 method invocations Computations are distributed over a sub-network of 14 workstations Five high-end workstations, the most powerful ones, are connected in a Gigabit Ethernet LAN Low-end workstations are connected to a LAN segment at 100 MBit/s and talk among themselves at 100 Mbit/s Each experience was run on a subset of ten computers: nine servers and one client Workstations run CentOS v5 64 bits Memory varies between four to 16 Gbytes Workstations are based on Athlon X2 Dual Core Processor 4400 The five high-end workstations are either single or dual Opteron Workstations run basic Unix services and user processes The client computer was also responsible to measure execution times and to verify the liveness of connections Connections to servers as well as connections to the database were implemented on 24 top of TCP/IP (AF INET) sockets All components have been implemented in Java 1.5 64bits The database server was MySQL server v5.0.77