Your SlideShare is downloading. ×

Reference Representation in Large Metamodel-based Datasets

336

Published on

This presentation was held at the BigMDE Workshop (at STAF) in Budapest, 2013

This presentation was held at the BigMDE Workshop (at STAF) in Budapest, 2013

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
336
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Markus Scheidgen Model representations for large meta-model based data-sets ■ Introduction: Technological spaces and model representations ■ Comparison of representation ■ Implementation ■ Application 1
  • 2. Introduction: Technological Spaces 2 Software Models Code reverse engineering code generation XML persistence / exchange databases persistence/versioning processing (via ORMs: e.g. JPA) Objects (e.g. POJOs) debugging/profiling reflection runtimemodeling processing (e.g. dom/jaxb) exchange (e.g. in web-services) xslt/xsl/ xquery/xpath model-transformation/ -constraints/-queries static analysis/compilation/ refactoring SQL running programs other data otherdata otherdata otherda ta ot herdata
  • 3. Introduction: State of the Art 3 Meta-Models Models Schemas XML Gammars Code Classes Objects ER-Schemas Relational Data * visualization and editing by human users processing in computer programs exchange large data-sets/ persistence and querying
  • 4. Introduction: New Class of DBMS 4 Meta-Models Models Schemas XML Gammars Code Classes Objects ER-Schemas Relational Data * - Big Data + - Graphs ER-Schemas Big Relational Data ?
  • 5. Representation: Strategies 5 Object-by-object Fragments Part-of-source Morsa, (Java) XMI, EMF-Frag Relations CDO ? References Objects
  • 6. Representation: Object-by-object vs. Fragmentation (considering traversal, theoretical results) 6 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3 10 4 10 5 Number of loaded objects [l] no fragmentation [f=m] optimal fragmentation total fragmentation [f=1] Executiontime[t](inms) 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 Fragment size [f]
  • 7. Representation: Object-by-object vs. Fragmentation (considering traversal, theoretical results vs. implementation) 7 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3 10 4 10 5 Number of loaded objects [l] no fragmentation [f=m] optimal fragmentation total fragmentation [f=1] Executiontime[t](inms) 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 Fragment size [f] 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 1 10 2 10 3 10 4 10 5 Number of loaded objects [l] Executiontime[t](inms) 1e+01 1e+02 1e+03 1e+04 1e+05 Fragment size [f] optimal fragmentation
  • 8. Representation: Object-by-object vs. Fragmentation (considering traversal, implementation with actual model) ■ Model traversal of Grabats models with four different sizes and different characteristics 8 set0 set1 set2 set3 set4 0 1 2 3 4 5 6 7 8 XMI CDO Morsa EMFFrag coarse EMFFrag fine notmeasured–extrapolated notmeasured–extrapolated Objectspersecond(10 4 ) set0 set1 set2 set3 set4 10 3 10 4 10 5 10 6 10 7 Numberoffragments CDO/Morsa EMFFrag coarse EMFFrag fine
  • 9. Representation: Object-by-object vs. Fragmentation (considering query, implementation with actual model) ■ Query of Grabats models with four different sizes and different characteristics 9 set0 set1 set2 set3 set4 10 3 10 4 10 5 10 6 10 7 Numberoffragments CDO/Morsa EMFFrag coarse EMFFrag fine set0 set1 set2 set3 set4 0 50 100 150 200 250 300 350 Executiontime(ins) XMI CDO w/o SQL CDO Morsa w/o index Morsa EMFFrag coarse EMFFrag fine notmeasured–extrapolated notmeasured–extrapolated notmeasured–extrapolated notmeasured–extrapolated
  • 10. Representation: Part-of-source vs. Relations (real implementation, artificial model) 10 10 0 10 2 10 4 10 6 10 1 10 2 10 3 10 4 number of outgoing references executiontimeinms 10 0 10 2 10 4 10 6 10 1 10 2 10 3 10 4 number of outgoing references executiontimeinms Part of source implementation Relation implementation with individual access access of one outgoing reference traversal of all outgoing references access of one outgoing reference traversal of all outgoing references
  • 11. Representation: Part-of-source vs. Relations (real implementation, artificial model) 11 10 0 10 2 10 4 10 6 10 1 10 2 10 3 10 4 number of outgoing references executiontimeinms Part of source implementation access of one outgoing reference traversal of all outgoing references 10 0 10 2 10 4 10 6 10 1 10 2 10 3 10 4 number of outgoing references executiontimeinms Relation implementation with scanning access of one outgoing reference traversal of all outgoing references
  • 12. 1 2 3 4 Implementation: EMF-Fragments 12 map/reduce (hadoop) “Share Nothing” Nodes (cluster, adhoc-network) DFS (HDFS) key-value-store (hbase) structured datadata-sets applications meta-model structured datamodel transformations
  • 13. Implementation: Datastore mapping 13 regular containment metamodel 0 1 part of source fragmentation relation based fragmentation
  • 14. Implementation: Meta-mode-based declaration of representations 14 Project Package CompilationUnit FieldMethod Class «fragments» «fragments» «fragments» * * * * * * Call «relation»
  • 15. Implementation: Architecture 15 FragmentedModel extends Resource ResourceSet FObject extends EObject FStore extends EStore ResourceSet Fragment extends Resource FInternalObject extends DynamicEObject URIHandler DataStore * * 1 * * 1 11 1 visibleAPI EMF-Fragments Classes Regular EMF Classes 1 EList EObjectEList FValueSetList * 1 *
  • 16. Applications: Mining and Analyzing Software Repositories ■ Software repositories contain more information than the current software code: ■ “developers who changed class/method/statement X also changed class/ method/statement Y” ■ this information leads to knowledge about dependencies that cannot be determined through static or even dynamic analysis ■ this can be used to • predict/find bugs • understand/improve the code-base ■ dependency information should be stored as relational data ■ When a piece of software evolves, its metrics change. Such dynamic metrics describe software better than static code metrics. Could lead to a better assessment of methodologies or understanding of software engineering in general. 16
  • 17. Applications: Mining and Analyzing Software Repositories ■ JGit: Java implementation of the Git version control system ■ MoDisco: Reverse engineering framework for eclipse java projects based on EMF ■ EMF-Compare: Determines matches and differences between models ■ EMF-Fragments: My own framework for large models ■ over 300 Git repositories with eclipse plug-ins that constitute the whole eclipse foundation source base as “example” data-set 17
  • 18. Applications: Model of a Software Repository 18 A B C A A B A D PB1.R1 B1.R2 B1.R3 B1.R4 B2.R1 B2.R2 A A B Repository Revision Diff Compilation Unit Model Package Class ... * * * * * 1 prevnext JGit MoDisco modelmetamodel usageIn Package Access * package1 «relation, fragmentation» «fragmentation» «relation, fragmentation» «relation» «fragmentation» * * extends1
  • 19. Summary ■ Choosing the right representation makes a difference ■ Meta-model-based declaration of representations works (might not be good enough) ■ There are applications that can benefit from different representations 19 Object-by-object Fragments Part-of-source Morsa, (Java) XMI, EMF-Frag Relations CDO ? References Objects
  • 20. Backup 20
  • 21. Possible Approaches: Different Target Platforms 21 Schemas XML * - Big Data - Graphs BASE CAP-Theorem1 1Eric A. Brewer: Towards robust distributed systems; 19th ACM Symposium on Principles of Distributed Computing, 2000 2K. Barmpis and D.S. Kolovos. Comparative Analysis of Data Persistence Technologies for Large-Scale Models. XM 2012 ORM XMI XM I+Resources ER-Schemas Relational Data ACID, structured data ER-Schemas Big Relational Data BASE, structured data BASE, structured data Big * ORM? 2
  • 22. Possible Approaches: Different Types of Mapping 22 * 1Javier Espinazo-Pagán, Jesús Sánchez Cuadrado, Jesús García Molina: Morsa, A Scalable Approach for Persisting and Accessing Large Models; MoDELS 2011 per object m apping fragm entation ER-Schemas Relational Data fast query, slow traversal, slow entry, (fine transactions) fast query, slow traversal, slow entry, (fine transactions)1 Big * perobject mapping slow query, fast traversal, fast entry, (coarse trans.) Big *ER-Schemas Big Relational Data /
  • 23. Fragmentation: Types of references ■ organizing large artifacts in different resources is already implemented in EMF ■ resources are loaded if necessary, objects in unloaded resources are represented by proxy objects ■ objects in different resources (as all related objects) are related through references, therefore models are fragmented along references ■ EMF-Fragments automatically fragments large models based on annotations in the meta-model ■ resources are identified via URIs and can be serialized (e.g. XMI), therefore resources can be stored in a key-value store 23
  • 24. Fragmentation: Types of references 24 * normal references * «fragments»fragmenting references large value sets *
  • 25. Applications ■ HWL sensor and network operation data (or experiment data in general) ■ realtime persistence required ➜ fast data entry ■ hierarchical structured data (different sensors and other data sources) ➜ meta-modeling ■ queries for experiments, sensors, specific time periods ➜ only coarse simple queries ■ traversal of larger sub-trees, mostly applications based on data aggregation ■ actual demand for big-data depends on size of sensor network ➜ scalability ■ CityGML models (or geo-spatial data in general) ■ standardized as XML-schemas ➜ XML based data ■ special proprietary indexes (e.g. spacial indexes like R-trees) and corresponding queries ■ rather query intense applications ■ actual demand for big-data depends on LOL of the models ➜ scalability ■ Software Engineering ■ Code/Model Version Control ■ Mining Software Repositories (MSR) ■ revisions of AST-trees and differences between AST-trees ➜ existing meta-model based frameworks (e.g. designed for reverse engineering purposes) ■ large number of revisions causes many large value sets ■ queries for revisions, compilation-units ➜ rather coarse queries ■ aggregations and statistics ➜ can be expressed in an OCL-like language ■ immediate demand for processing in (at least smaller) clusters ■ has to be mixed with relational data for some applications 25
  • 26. Applications: Scientific Data 26 WSN <xml?...> <xml?...> click * * xml-to-model text-to-model*
  • 27. Applications: CityGML ■ XML-based standard ➜ meta-models can be generated (1- to-1 mapping) ■ different standards define XML-schemas that extend each other: GML⇽CityGML⇽extensions ■ transparent use of spacial indexes ■ map onto existing platforms (e.g. SpatialHadoop) ■ use existing implementations and persist into the key-value store ■ extensions to CityGML can be facilitated to reference CityGML-models as spatial context for sensor data 27
  • 28. backup 28
  • 29. Research Overview 29 W IRELESS SENSOR NETWORKS DATA ANALYSISFRAMEWORK G EO INFORMATION SYSTEMS sensor data heterogenous networks mesh- networks cellular- networks spatial data regular databases spatial databases distributed data stores distributed analysis data homo- genisation domain specific analysis languages
  • 30. HWL: Commodity Hardware 30
  • 31. 31
  • 32. ‣120+ Nodes ‣indoor and outdoor ‣dense and sparse ‣short and long links ‣stationary and mobil nodes
  • 33. ‣120+ Nodes ‣indoor and outdoor ‣dense and sparse ‣short and long links ‣stationary and mobil nodes
  • 34. 1 2 3 4 6 7 8 9 stein ? m 10m 5 10 Richtung Groß-Berliner Damm Richtung Institut MarkusScheidgen:HWL–AHigh-PerformanceWirelessSensorResearchNetwork 35 Experiments: The Test Site § simplest case: two lane, newly paved road § spatially equally distributed nodes on both sides of the rode § 2x5 nodes § homogeneous test-bed: same nodes, equally calibrated, same stone ground § one camera to record control data
  • 35. 0 20 40 60 80 100 120 140 160 180 200 0 50 100 150 200 250 300 350 400 450 Single−sided Amplitude Spectrum Frequency (Hz) |Y(fr)| Channel Z Channel Y Channel X 0 500 1000 1500 2000 2500 3000 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Time sample (1/400 sec) Acceleratorvalue Time signal of all 3 channels Channel Z Channel Y Channel X MarkusScheidgen:HWL–AHigh-PerformanceWirelessSensorResearchNetwork Experiments: Example Data 36 Amplitudes Frequencies
  • 36. MarkusScheidgen:HWL–AHigh-PerformanceWirelessSensorResearchNetwork Experiment: Algorithm § Similar to earthquake detection: comparison of short and long moving averages (S=0.2s, L=4s) 38 sx = xth acceleration value (1) mavg(sx, W) = Px i=x W si W (2) ˆsx = |sx avg(sx, L)| (3) wS x = mavg(ˆsx, S) (4) wL x = mavg(ˆsx, L) (5) w = wS x wL x (6)
  • 37. Data Management 39
  • 38. Research Overview 40 W IRELESS SENSOR NETWORKS DATA ANALYSISFRAMEWORK G EO INFORMATION SYSTEMS sensor data heterogenous networks mesh- networks cellular- networks spatial data regular databases spatial databases distributed data stores distributed analysis data homo- genisation domain specific analysis languages
  • 39. 41 internet cellular cellular wifi zigbee zigbee Technological Infrastructure
  • 40. Logical Infrastructure actions visualization sensors information
  • 41. 43 internet cellular cellular wifi zigbee zigbee information/knowledge distributed programming models data bases data representation algorithms processes programming languages CPUs machine code radios network protocols hard drives genericdomainspecific software engineering algorithms processes programming languages information/knowledge distributed programming models data bases data representation DSL
  • 42. Complex Data Types 44 ➡ complex data structures ➡ lots of links between data objects ➡ evolving structures ➡ requires a type safe programming environment that proliferates re- use
  • 43. Large Amounts of Data 45 ➡ a certain amount of data needs to be stored per second (HWL: 120 nodes) ~140x103 data objects per second ~7MB/s serialized ➡ a certain amount of data needs to be stored all together (24h) ~12x109 data objects ~600GB serialized ➡ Data analysis must complete in reasonable time. For live applications in real time.
  • 44. From Click to ClickWatch 46 Click API software Element Element Element Compound Handler Handler NetworkInterface
  • 45. Complex Data Types: Meta-Modeling 47 This [ ] happens all the time in software modeling state charts class diagrams MSCsOCL context Foo self.properties-> foreach(a|a.x != a.y) eclipse modeling framework (EMF) ➡ Distributed storage and links between different types of data is only a simple extension of existing technology: multi resource persistence is already implemented
  • 46. “Share Nothing” Nodes (cluster, adhoc-network) DFS (HDFS) key-value-store1 (hbase) Large Amounts of Data: Problem Statement 48 1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A distributed storage system for structured data (awarded best paper!). In Brian N. Bershad and Jeffrey C. Mogul, editors, OSDI, pages 205–218. USENIX Association, 2006. 2. Jeffrey Dean and Sanjay Ghemawat. Map/reduce: Simplified data processing on large clusters. In OSDI, pages 137– 150. USENIX Association, 2004. map/reduce2 (hadoop) hierarchical data (XML, OGC standards) data series (sensor data) signal analysis, statistics, sensor-fusion domainspecificgeneric
  • 47. 1 2 3 4 Large Amounts of Data: Approach 49 map/reduce (hadoop) “Share Nothing” Nodes (cluster, adhoc-network) DFS (HDFS) key-value-store (hbase) hierarchical data (XML, OGC standards) data series (sensor data) signal analysis, statistics, sensor-fusion meta-model structured datamodel transformations

×