SlideShare a Scribd company logo
1 of 30
Download to read offline
Scaling  Classical  Clone  
Detec/on  Tools  for  Ultra-­‐
Large  Datasets
	
  
	
  
Jeffrey	
  Svajlenko,	
  Iman	
  Keivanloo,	
  Chanchal	
  Roy	
  
IWSC	
  2013	
  
Inter-­‐Project  Clone  Detec/on
•  Ac>ve	
  research	
  topic	
  in	
  the	
  community.	
  
•  Goal:	
  Construct	
  inter-­‐project	
  clone	
  corpus.	
  
•  Applica*ons	
  
•  Study	
  Global	
  Developer	
  Behavior	
  
•  Discover	
  Poten>al	
  APIs	
  and	
  Libraries	
  
•  Internet-­‐Scale	
  Clone	
  Search	
  
•  API	
  Recommenda>on	
  
•  API	
  Usage	
  Support	
  
•  …	
  
Problem:  Inter-­‐Project  Detec/on
•  Many	
  state	
  of	
  the	
  art	
  tools	
  do	
  not	
  scale	
  to	
  large	
  
datasets.	
  (classical	
  tools)	
  
•  Memory	
  Requirements	
  
•  Computa>onal	
  Complexity	
  
•  Execu>on	
  Time	
  
•  Underlying	
  limita>ons	
  in	
  their	
  algorithms	
  or	
  data	
  structures.	
  
•  Instead	
  novel	
  scalable	
  techniques	
  are	
  used.	
  
•  Challenging	
  to	
  develop.	
  
•  Wish	
  to	
  use	
  tools	
  from	
  a	
  variety	
  of	
  domains	
  when	
  
building	
  an	
  inter-­‐project	
  clone	
  corpus.	
  
Goal  and  Mo/va/on
GOAL	
  
To	
  scale	
  classical	
  clone	
  detec,on	
  tools	
  to	
  ultra	
  large	
  
dataset.	
  
	
  
MOTIVATION	
  
To	
  allow	
  classical	
  clone	
  detec>on	
  tools	
  to	
  contribute	
  
to	
  inter-­‐project	
  clone	
  corpuses.	
  	
  
Shuffling  Framework
•  Scales	
  classical	
  tools	
  to	
  ultra-­‐large	
  datasets.	
  
•  Using	
  standard	
  hardware.	
  
•  Without	
  modifying	
  the	
  original	
  tool.	
  
•  Incurs	
  a	
  loss	
  of	
  recall.	
  
	
  
•  Method:	
  Non-­‐Determinis>c	
  Dataset	
  Par>>oning	
  
Shuffling  Framework  -­‐  Procedure
1.  The	
  source	
  files	
  of	
  the	
  dataset	
  are	
  randomly	
  
par>>oned	
  into	
  n	
  equally	
  sized	
  subsets.	
  
Ultra-­‐Large	
  Dataset	
  
1	
   2	
   3	
   4	
  
5	
   6	
   7	
   8	
  
9	
   10	
   11	
   12	
  
13	
   14	
   15	
   16	
  
Subset	
  size	
  dictated	
  by	
  clone	
  detec>on	
  tool’s	
  scalability	
  limits.	
  
Shuffling  Framework  -­‐  Procedure
2.  Each	
  subset	
  is	
  searched	
  independently	
  by	
  the	
  
clone	
  detec>on	
  tool.	
  
1	
   Clone	
  Detec>on	
  Tool	
  
2	
   Clone	
  Detec>on	
  Tool	
  
16	
   Clone	
  Detec>on	
  Tool	
   16	
  
.	
  .	
  .	
   2	
  
1	
  
Shuffling  Framework  -­‐  Procedure
3.  The	
  detected	
  clone	
  pairs	
  are	
  added	
  to	
  a	
  clone	
  
repository.	
  
1	
   2	
   3	
   4	
  
5	
   6	
   7	
   8	
  
9	
   10	
   11	
   12	
  
13	
   14	
   15	
   16	
  
Detected	
  
Clones	
  
Shuffling  Framework  -­‐  Procedure
4.  Steps	
  (1)	
  through	
  (3)	
  are	
  repeated	
  for	
  r	
  rounds.	
  
Dataset	
   Clone	
  
Repository	
  
r	
  rounds	
  
n*r	
  detec>on	
  experiments	
  
Shuffling  Framework  -­‐  Evalua/on
Gold	
  Standard	
  
•  Clone	
  detec>on	
  report	
  of	
  a	
  tool	
  executed	
  na>vely	
  
(without	
  shuffling).	
  
	
  
Total	
  Recall	
  
•  %	
  of	
  gold	
  standard	
  found	
  afer	
  r	
  shuffling	
  rounds	
  of	
  n	
  
par>>ons.	
  
•  Measure	
  for	
  unique	
  clone	
  pairs	
  or	
  unique	
  cloned	
  
fragments.	
  
Preliminary  Study
•  Test	
  with	
  “regular	
  size”	
  systems:	
  
•  JHotDraw	
  (20	
  KLOC,	
  285	
  files)	
  
•  ArgoUML	
  (190KLOC,	
  1845	
  files)	
  
•  JDK1.7	
  (900KLOC,	
  6916	
  files)	
  
	
  
•  Tools:	
  
•  CCFinder,	
  Deckard,	
  iClones,	
  NiCad,	
  SimCad,	
  Simian	
  
	
  
•  Shuffling:	
  15	
  subsets,	
  30	
  shuffling	
  rounds	
  
	
  
•  Measured:	
  total	
  recall	
  afer	
  each	
  round	
  
Preliminary  Study  –  JDK1.7
0	
  
0.1	
  
0.2	
  
0.3	
  
0.4	
  
0.5	
  
0.6	
  
0.7	
  
0.8	
  
0.9	
  
1	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
   11	
   12	
   13	
   14	
   15	
   16	
   17	
   18	
   19	
   20	
   21	
   22	
   23	
   24	
   25	
   26	
   27	
   28	
   29	
   30	
  
Recall	
  
Round	
  
Deckard	
  (1834042)	
  
iClones	
  (49716)	
  
NiCad	
  (8105)	
  
SimCad	
  (549923)	
  
Simian	
  (217409)	
  
n	
  =	
  15	
  subsets,	
  r	
  =	
  30	
  rounds	
  
Preliminary  Study
•  ~60-­‐90%	
  total	
  recall	
  achievable	
  
•  Shuffling	
  performance	
  varies	
  by	
  detec>on	
  tool.	
  
•  Generally,	
  a	
  larger	
  gold	
  standard	
  requires	
  more	
  
rounds	
  to	
  get	
  the	
  same	
  total	
  recall.	
  
Main  Experiment:  Dataset
IJaDataset	
  2.0:	
  An	
  Inter-­‐Project	
  Java	
  Corpus	
  
•  Keivanloo	
  et	
  al,	
  2012	
  (Proc.	
  MSR)	
  
•  Crawled	
  25,000	
  Open-­‐Source	
  Java	
  Projects	
  
•  3	
  million	
  java	
  source	
  files,	
  356	
  MLOC	
  
•  Outliers	
  (>2000	
  lines)	
  
•  6238	
  removed	
  
Experiment  -­‐  Hardware
Clone	
  detec>on	
  (shuffling):	
  
•  Worksta>on-­‐Class	
  Hardware	
  
•  Quad	
  Core	
  CPU	
  
•  12-­‐16GB	
  of	
  RAM	
  
•  Above	
  Average	
  Disk	
  IO	
  
•  ~$1000	
  PC	
  
	
  
•  Allocated	
  on	
  shared	
  cloud	
  resources.	
  
•  Western	
  Canada	
  Research	
  Grid	
  (Bugaboo	
  Cluster)	
  
•  Amazon	
  EC2	
  Instances	
  
Experiment  -­‐  Tools
•  Simian	
  
•  NiCad	
  
•  Deckard	
  
•  CCFinderX	
  
•  Terminated	
  without	
  explana>on.	
  
•  SimCad	
  
•  Execu>on	
  aborts	
  on	
  troublesome	
  file.	
  
•  iClones	
  
•  Compa>bility	
  issue.	
  
Simian
•  IJaDataset2	
  
•  Scalability	
  limit:	
  RAM	
  
•  50,000	
  file	
  subsets	
  (58	
  par>>ons),	
  30	
  rounds	
  
•  8-­‐12hr	
  to	
  par>>on,	
  4-­‐10hr	
  for	
  detec>on	
  (per	
  round)	
  
•  Serng	
  
•  Minimum	
  Clone	
  Size:	
  6	
  lines	
  
•  No	
  source	
  normaliza>on	
  (execu>on	
  >me)	
  
•  Gold	
  Standard	
  
•  Amazon	
  EC2	
  instance	
  with	
  68GB	
  of	
  RAM	
  
•  300	
  billion	
  clone	
  pairs,	
  11	
  million	
  cloned	
  fragments	
  
Simian:  Cloned  Fragment  Recall
0.166903883	
  
0.476927684	
  
0.626533533	
  
0.715431474	
  
0	
  
0.1	
  
0.2	
  
0.3	
  
0.4	
  
0.5	
  
0.6	
  
0.7	
  
0.8	
  
0.9	
  
1	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
   11	
   12	
   13	
   14	
   15	
   16	
   17	
   18	
   19	
   20	
   21	
   22	
   23	
   24	
   25	
   26	
   27	
   28	
   29	
   30	
  
Clone	
  Fragment	
  Recall	
  
Round	
  
Considering	
  only	
  clone	
  classes	
  with	
  <=	
  100	
  fragments.	
  
Simian:  Clone  Recall  (Trim)
0.24792718	
  
0.619514665	
  
y	
  =	
  0.0067x	
  +	
  0.0533	
  
R²	
  =	
  0.99585	
  
y	
  =	
  0.1364ln(x)	
  +	
  0.1199	
  
R²	
  =	
  0.95064	
  
0	
  
0.1	
  
0.2	
  
0.3	
  
0.4	
  
0.5	
  
0.6	
  
0.7	
  
0.8	
  
0.9	
  
1	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
   11	
   12	
   13	
   14	
   15	
   16	
   17	
   18	
   19	
   20	
   21	
   22	
   23	
   24	
   25	
   26	
   27	
   28	
   29	
   30	
  
Total	
  Recall	
  
Rounds	
  
Clone	
  Pairs	
   Cloned	
  Fragments	
   Linear	
  (Clone	
  Pairs)	
   Log.	
  (Cloned	
  Fragments)	
  
NiCad
•  IJaDataset2	
  
•  Scalability:	
  Limited	
  data-­‐structure	
  size.	
  
•  10,000	
  file	
  subsets,	
  289	
  par>>ons,	
  20	
  rounds	
  
•  7-­‐15hr	
  par>>oning,	
  23-­‐31hr	
  detec>on	
  (per	
  round)	
  
	
  
•  Serngs:	
  
•  Clone	
  Size:	
  10-­‐2500	
  lines.	
  
•  Minimum	
  clone	
  similarity:	
  70%	
  
•  Gold	
  Standard	
  
•  Not	
  possible.	
  
NiCad  –  Detec/on  vs.  Rounds
y	
  =	
  245387x	
  +	
  767852	
  
R²	
  =	
  0.99993	
  
0.00E+00	
  
1.00E+05	
  
2.00E+05	
  
3.00E+05	
  
4.00E+05	
  
5.00E+05	
  
6.00E+05	
  
7.00E+05	
  
8.00E+05	
  
9.00E+05	
  
1.00E+06	
  
0.00E+00	
  
1.00E+06	
  
2.00E+06	
  
3.00E+06	
  
4.00E+06	
  
5.00E+06	
  
6.00E+06	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
   11	
   12	
   13	
   14	
   15	
   16	
   17	
   18	
   19	
   20	
  
Unique	
  Cloned	
  Fragments	
  Found	
  
Unique	
  Clones	
  Found	
  
Round	
  
Unique	
  Clones	
  Found	
  
Unique	
  Clone	
  Fragments	
  Found	
  
Deckard
•  IJaDataset	
  
•  Scalability	
  Limit:	
  Execu>on	
  >me.	
  
•  10,000	
  file	
  subsets,	
  289	
  par>>ons,	
  20	
  rounds	
  
•  7-­‐15hr	
  par>>oning,	
  5-­‐7	
  days	
  detec>on	
  (per	
  round)	
  
•  Serngs:	
  
•  Minimum	
  Fragment	
  Size:	
  50	
  tokens	
  
•  Sliding	
  Window:	
  5	
  tokens	
  
•  Minimum	
  Clone	
  Similarity:	
  90%	
  (tree)	
  
•  Gold	
  Standard	
  
•  Execu>on	
  >me	
  too	
  long.	
  
Deckard:  Detec/on  vs.  Rounds
1.00E+07	
  
1.10E+07	
  
1.20E+07	
  
1.30E+07	
  
1.40E+07	
  
1.50E+07	
  
1.60E+07	
  
1.70E+07	
  
1.80E+07	
  
1.90E+07	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
  
Unique	
  Reported	
  Clone	
  Fragments	
  
Round	
  
Deckard  –  Detec/on  vs.  Rounds  (Trim)
Considering	
  only	
  clone	
  classes	
  with	
  <=	
  10	
  fragments.	
  
0.00E+00	
  
2.00E+06	
  
4.00E+06	
  
6.00E+06	
  
8.00E+06	
  
1.00E+07	
  
1.20E+07	
  
1.40E+07	
  
1.60E+07	
  
1.80E+07	
  
0.00E+00	
  
2.00E+07	
  
4.00E+07	
  
6.00E+07	
  
8.00E+07	
  
1.00E+08	
  
1.20E+08	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
  
Unique	
  Clone	
  Fragments	
  Found	
  
Round	
  
Clones	
  
Fragments	
  
Main  Experiment  Conclusions
•  Shuffling	
  framework	
  finds	
  cloned	
  fragments	
  faster	
  than	
  
the	
  clone	
  pair	
  rela>onships	
  between	
  them.	
  
•  A	
  large	
  number	
  of	
  rounds	
  may	
  be	
  needed	
  to	
  detect	
  a	
  
sizable	
  number	
  of	
  the	
  clone	
  pairs.	
  
•  Appropriate	
  when	
  loss	
  of	
  recall	
  is	
  acceptable.	
  
•  Ex:	
  contribu>ng	
  towards	
  mul>-­‐tool	
  clone	
  corpus.	
  
•  Processing	
  the	
  clones	
  found	
  in	
  a	
  inter-­‐project	
  clone	
  
corpus	
  can	
  become	
  itself	
  a	
  scalability	
  issue.	
  
Clone  Recovery
How	
  can	
  we	
  improve	
  clone	
  pair	
  discovery?	
  
•  Without	
  a	
  significant	
  increase	
  in	
  rounds?	
  
IDEA:	
  Leverage	
  Cloned	
  Fragment	
  Detec2on	
  Ability	
  
•  Apply	
  Transi>ve	
  Property	
  on	
  Clone	
  Repository.	
  
•  If	
  (A,B)	
  and	
  (B,C)	
  then	
  (A,C)	
  	
  
•  Perform	
  clone	
  search	
  amongst	
  cloned	
  fragments.	
  
Transi/ve  Clone  Recovery  Test
0	
  
0.1	
  
0.2	
  
0.3	
  
0.4	
  
0.5	
  
0.6	
  
0.7	
  
0.8	
  
0.9	
  
1	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
   11	
   12	
   13	
   14	
   15	
   16	
   17	
   18	
   19	
   20	
   21	
   22	
   23	
   24	
   25	
   26	
   27	
   28	
   29	
   30	
  
Recall	
  
Round	
  
Clone	
  Recall	
   Heuris>c	
  Recall	
   Recovered	
  Recall	
  
NiCad,	
  JDK1.7	
  
Transi/ve  Clone  Recovery  Test
0	
  
0.1	
  
0.2	
  
0.3	
  
0.4	
  
0.5	
  
0.6	
  
0.7	
  
0.8	
  
0.9	
  
1	
  
1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
   9	
   10	
   11	
   12	
   13	
   14	
   15	
   16	
   17	
   18	
   19	
   20	
   21	
   22	
   23	
   24	
   25	
   26	
   27	
   28	
   29	
   30	
  
Recall	
  
Round	
  
Clone	
  Recall	
   Heuris>c	
  Recall	
   Recovered	
  Recall	
  
Simian,	
  JDK1.7	
  
Future  Work
1.  Inves>gate	
  addi>onal	
  tools.	
  
2.  Inves>gate	
  efficient	
  clone	
  recovery	
  methods.	
  
3.  Directly	
  compare	
  with	
  determinis>c	
  approach.	
  
4.  Use	
  the	
  shuffling	
  framework	
  to	
  contribute	
  
towards	
  an	
  inter-­‐project	
  clone	
  corpus	
  (IJaDataset	
  
2.0).	
  
Thank  You!

More Related Content

What's hot

Demystifying Solr Cloud Autoscaling: Simulations and Testing
Demystifying Solr Cloud Autoscaling: Simulations and Testing Demystifying Solr Cloud Autoscaling: Simulations and Testing
Demystifying Solr Cloud Autoscaling: Simulations and Testing Lucidworks
 
JPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APIJPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APItvaleev
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
Getting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaGetting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaDave Snowdon
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
RedisConf17- durable_rules
RedisConf17- durable_rulesRedisConf17- durable_rules
RedisConf17- durable_rulesRedis Labs
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsMySQLConference
 
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...MongoDB
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessJon Haddad
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...Thom Lane
 
Don't dump thread dumps
Don't dump thread dumpsDon't dump thread dumps
Don't dump thread dumpsTier1app
 
Distributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayDistributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayodnoklassniki.ru
 

What's hot (20)

Demystifying Solr Cloud Autoscaling: Simulations and Testing
Demystifying Solr Cloud Autoscaling: Simulations and Testing Demystifying Solr Cloud Autoscaling: Simulations and Testing
Demystifying Solr Cloud Autoscaling: Simulations and Testing
 
JPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream APIJPoint 2016 - Валеев Тагир - Странности Stream API
JPoint 2016 - Валеев Тагир - Странности Stream API
 
Profiling Ruby
Profiling RubyProfiling Ruby
Profiling Ruby
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Getting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaGetting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in java
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
RedisConf17- durable_rules
RedisConf17- durable_rulesRedisConf17- durable_rules
RedisConf17- durable_rules
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
Partner Webinar: MongoDB and Softlayer on Bare Metal: Stability, Performance,...
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
CIFAR-10 for DAWNBench: Wide ResNets, Mixup Augmentation and "Super Convergen...
 
Don't dump thread dumps
Don't dump thread dumpsDon't dump thread dumps
Don't dump thread dumps
 
Distributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayDistributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevday
 

Similar to Scaling classical clone detection tools for ultra large datasets

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
What you need to know about GC
What you need to know about GCWhat you need to know about GC
What you need to know about GCKelum Senanayake
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackTon Ngo
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionTanel Poder
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneEnkitec
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
Talk at dnGASP workshop, April 5, 2011
Talk at dnGASP workshop, April 5, 2011Talk at dnGASP workshop, April 5, 2011
Talk at dnGASP workshop, April 5, 2011Fedor Tsarev
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudPasquale Salza
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Optimal Strategies for Large-Scale Batch ETL Jobs
Optimal Strategies for Large-Scale Batch ETL JobsOptimal Strategies for Large-Scale Batch ETL Jobs
Optimal Strategies for Large-Scale Batch ETL JobsEmma Tang
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangDatabricks
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
SPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic librarySPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic libraryAdaCore
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Javier Cantón Ferrero
 

Similar to Scaling classical clone detection tools for ultra large datasets (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
What you need to know about GC
What you need to know about GCWhat you need to know about GC
What you need to know about GC
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStack
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Talk at dnGASP workshop, April 5, 2011
Talk at dnGASP workshop, April 5, 2011Talk at dnGASP workshop, April 5, 2011
Talk at dnGASP workshop, April 5, 2011
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the Cloud
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Optimal Strategies for Large-Scale Batch ETL Jobs
Optimal Strategies for Large-Scale Batch ETL JobsOptimal Strategies for Large-Scale Batch ETL Jobs
Optimal Strategies for Large-Scale Batch ETL Jobs
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
SPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic librarySPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic library
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Scaling classical clone detection tools for ultra large datasets

  • 1. Scaling  Classical  Clone   Detec/on  Tools  for  Ultra-­‐ Large  Datasets     Jeffrey  Svajlenko,  Iman  Keivanloo,  Chanchal  Roy   IWSC  2013  
  • 2. Inter-­‐Project  Clone  Detec/on •  Ac>ve  research  topic  in  the  community.   •  Goal:  Construct  inter-­‐project  clone  corpus.   •  Applica*ons   •  Study  Global  Developer  Behavior   •  Discover  Poten>al  APIs  and  Libraries   •  Internet-­‐Scale  Clone  Search   •  API  Recommenda>on   •  API  Usage  Support   •  …  
  • 3. Problem:  Inter-­‐Project  Detec/on •  Many  state  of  the  art  tools  do  not  scale  to  large   datasets.  (classical  tools)   •  Memory  Requirements   •  Computa>onal  Complexity   •  Execu>on  Time   •  Underlying  limita>ons  in  their  algorithms  or  data  structures.   •  Instead  novel  scalable  techniques  are  used.   •  Challenging  to  develop.   •  Wish  to  use  tools  from  a  variety  of  domains  when   building  an  inter-­‐project  clone  corpus.  
  • 4. Goal  and  Mo/va/on GOAL   To  scale  classical  clone  detec,on  tools  to  ultra  large   dataset.     MOTIVATION   To  allow  classical  clone  detec>on  tools  to  contribute   to  inter-­‐project  clone  corpuses.    
  • 5. Shuffling  Framework •  Scales  classical  tools  to  ultra-­‐large  datasets.   •  Using  standard  hardware.   •  Without  modifying  the  original  tool.   •  Incurs  a  loss  of  recall.     •  Method:  Non-­‐Determinis>c  Dataset  Par>>oning  
  • 6. Shuffling  Framework  -­‐  Procedure 1.  The  source  files  of  the  dataset  are  randomly   par>>oned  into  n  equally  sized  subsets.   Ultra-­‐Large  Dataset   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   Subset  size  dictated  by  clone  detec>on  tool’s  scalability  limits.  
  • 7. Shuffling  Framework  -­‐  Procedure 2.  Each  subset  is  searched  independently  by  the   clone  detec>on  tool.   1   Clone  Detec>on  Tool   2   Clone  Detec>on  Tool   16   Clone  Detec>on  Tool   16   .  .  .   2   1  
  • 8. Shuffling  Framework  -­‐  Procedure 3.  The  detected  clone  pairs  are  added  to  a  clone   repository.   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   Detected   Clones  
  • 9. Shuffling  Framework  -­‐  Procedure 4.  Steps  (1)  through  (3)  are  repeated  for  r  rounds.   Dataset   Clone   Repository   r  rounds   n*r  detec>on  experiments  
  • 10. Shuffling  Framework  -­‐  Evalua/on Gold  Standard   •  Clone  detec>on  report  of  a  tool  executed  na>vely   (without  shuffling).     Total  Recall   •  %  of  gold  standard  found  afer  r  shuffling  rounds  of  n   par>>ons.   •  Measure  for  unique  clone  pairs  or  unique  cloned   fragments.  
  • 11. Preliminary  Study •  Test  with  “regular  size”  systems:   •  JHotDraw  (20  KLOC,  285  files)   •  ArgoUML  (190KLOC,  1845  files)   •  JDK1.7  (900KLOC,  6916  files)     •  Tools:   •  CCFinder,  Deckard,  iClones,  NiCad,  SimCad,  Simian     •  Shuffling:  15  subsets,  30  shuffling  rounds     •  Measured:  total  recall  afer  each  round  
  • 12. Preliminary  Study  –  JDK1.7 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   Recall   Round   Deckard  (1834042)   iClones  (49716)   NiCad  (8105)   SimCad  (549923)   Simian  (217409)   n  =  15  subsets,  r  =  30  rounds  
  • 13. Preliminary  Study •  ~60-­‐90%  total  recall  achievable   •  Shuffling  performance  varies  by  detec>on  tool.   •  Generally,  a  larger  gold  standard  requires  more   rounds  to  get  the  same  total  recall.  
  • 14. Main  Experiment:  Dataset IJaDataset  2.0:  An  Inter-­‐Project  Java  Corpus   •  Keivanloo  et  al,  2012  (Proc.  MSR)   •  Crawled  25,000  Open-­‐Source  Java  Projects   •  3  million  java  source  files,  356  MLOC   •  Outliers  (>2000  lines)   •  6238  removed  
  • 15. Experiment  -­‐  Hardware Clone  detec>on  (shuffling):   •  Worksta>on-­‐Class  Hardware   •  Quad  Core  CPU   •  12-­‐16GB  of  RAM   •  Above  Average  Disk  IO   •  ~$1000  PC     •  Allocated  on  shared  cloud  resources.   •  Western  Canada  Research  Grid  (Bugaboo  Cluster)   •  Amazon  EC2  Instances  
  • 16. Experiment  -­‐  Tools •  Simian   •  NiCad   •  Deckard   •  CCFinderX   •  Terminated  without  explana>on.   •  SimCad   •  Execu>on  aborts  on  troublesome  file.   •  iClones   •  Compa>bility  issue.  
  • 17. Simian •  IJaDataset2   •  Scalability  limit:  RAM   •  50,000  file  subsets  (58  par>>ons),  30  rounds   •  8-­‐12hr  to  par>>on,  4-­‐10hr  for  detec>on  (per  round)   •  Serng   •  Minimum  Clone  Size:  6  lines   •  No  source  normaliza>on  (execu>on  >me)   •  Gold  Standard   •  Amazon  EC2  instance  with  68GB  of  RAM   •  300  billion  clone  pairs,  11  million  cloned  fragments  
  • 18. Simian:  Cloned  Fragment  Recall 0.166903883   0.476927684   0.626533533   0.715431474   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   Clone  Fragment  Recall   Round   Considering  only  clone  classes  with  <=  100  fragments.  
  • 19. Simian:  Clone  Recall  (Trim) 0.24792718   0.619514665   y  =  0.0067x  +  0.0533   R²  =  0.99585   y  =  0.1364ln(x)  +  0.1199   R²  =  0.95064   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   Total  Recall   Rounds   Clone  Pairs   Cloned  Fragments   Linear  (Clone  Pairs)   Log.  (Cloned  Fragments)  
  • 20. NiCad •  IJaDataset2   •  Scalability:  Limited  data-­‐structure  size.   •  10,000  file  subsets,  289  par>>ons,  20  rounds   •  7-­‐15hr  par>>oning,  23-­‐31hr  detec>on  (per  round)     •  Serngs:   •  Clone  Size:  10-­‐2500  lines.   •  Minimum  clone  similarity:  70%   •  Gold  Standard   •  Not  possible.  
  • 21. NiCad  –  Detec/on  vs.  Rounds y  =  245387x  +  767852   R²  =  0.99993   0.00E+00   1.00E+05   2.00E+05   3.00E+05   4.00E+05   5.00E+05   6.00E+05   7.00E+05   8.00E+05   9.00E+05   1.00E+06   0.00E+00   1.00E+06   2.00E+06   3.00E+06   4.00E+06   5.00E+06   6.00E+06   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   Unique  Cloned  Fragments  Found   Unique  Clones  Found   Round   Unique  Clones  Found   Unique  Clone  Fragments  Found  
  • 22. Deckard •  IJaDataset   •  Scalability  Limit:  Execu>on  >me.   •  10,000  file  subsets,  289  par>>ons,  20  rounds   •  7-­‐15hr  par>>oning,  5-­‐7  days  detec>on  (per  round)   •  Serngs:   •  Minimum  Fragment  Size:  50  tokens   •  Sliding  Window:  5  tokens   •  Minimum  Clone  Similarity:  90%  (tree)   •  Gold  Standard   •  Execu>on  >me  too  long.  
  • 23. Deckard:  Detec/on  vs.  Rounds 1.00E+07   1.10E+07   1.20E+07   1.30E+07   1.40E+07   1.50E+07   1.60E+07   1.70E+07   1.80E+07   1.90E+07   1   2   3   4   5   6   7   8   9   10   Unique  Reported  Clone  Fragments   Round  
  • 24. Deckard  –  Detec/on  vs.  Rounds  (Trim) Considering  only  clone  classes  with  <=  10  fragments.   0.00E+00   2.00E+06   4.00E+06   6.00E+06   8.00E+06   1.00E+07   1.20E+07   1.40E+07   1.60E+07   1.80E+07   0.00E+00   2.00E+07   4.00E+07   6.00E+07   8.00E+07   1.00E+08   1.20E+08   1   2   3   4   5   6   7   8   9   10   Unique  Clone  Fragments  Found   Round   Clones   Fragments  
  • 25. Main  Experiment  Conclusions •  Shuffling  framework  finds  cloned  fragments  faster  than   the  clone  pair  rela>onships  between  them.   •  A  large  number  of  rounds  may  be  needed  to  detect  a   sizable  number  of  the  clone  pairs.   •  Appropriate  when  loss  of  recall  is  acceptable.   •  Ex:  contribu>ng  towards  mul>-­‐tool  clone  corpus.   •  Processing  the  clones  found  in  a  inter-­‐project  clone   corpus  can  become  itself  a  scalability  issue.  
  • 26. Clone  Recovery How  can  we  improve  clone  pair  discovery?   •  Without  a  significant  increase  in  rounds?   IDEA:  Leverage  Cloned  Fragment  Detec2on  Ability   •  Apply  Transi>ve  Property  on  Clone  Repository.   •  If  (A,B)  and  (B,C)  then  (A,C)     •  Perform  clone  search  amongst  cloned  fragments.  
  • 27. Transi/ve  Clone  Recovery  Test 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   Recall   Round   Clone  Recall   Heuris>c  Recall   Recovered  Recall   NiCad,  JDK1.7  
  • 28. Transi/ve  Clone  Recovery  Test 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   Recall   Round   Clone  Recall   Heuris>c  Recall   Recovered  Recall   Simian,  JDK1.7  
  • 29. Future  Work 1.  Inves>gate  addi>onal  tools.   2.  Inves>gate  efficient  clone  recovery  methods.   3.  Directly  compare  with  determinis>c  approach.   4.  Use  the  shuffling  framework  to  contribute   towards  an  inter-­‐project  clone  corpus  (IJaDataset   2.0).