Large-Scale Distributed Storage System for Business Provenance - Cloud 2011


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Large-Scale Distributed Storage System for Business Provenance - Cloud 2011

  1. 1. Large-Scale Distributed Storage System for Business Provenance CLOUD 2011 Szabolcs Rozsnyai, Aleksander Slominsiki, Yurdaer Doganata
  2. 2. Agenda <ul><li>Introduction and Motivation </li></ul><ul><li>Provenance in the Cloud </li></ul><ul><ul><li>System Overview </li></ul></ul><ul><ul><li>Data Model </li></ul></ul><ul><ul><li>Data Integration </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Querying </li></ul></ul>
  3. 3. Introduction and Motivation 1/2 <ul><li>What is Business Provenance (BP)? </li></ul><ul><ul><li>Monitors enterprise systems to track a consistent and accurate history of processes incl. involved artifacts, events and relationships </li></ul></ul><ul><ul><li>BP enables a comprehensive and complete insight into the processes </li></ul></ul><ul><ul><li>BP helps to discover the functional, organizational , data and resource aspects of a business </li></ul></ul><ul><li>Challenges </li></ul><ul><ul><li>Volume and the complexity makes tracking and processing a difficult and resource intensive task </li></ul></ul><ul><ul><li>As data grows at a very high rate, tracking arbitrary artifacts for provenance purposes within large organizations is very costly </li></ul></ul><ul><ul><li>Storing, organizing, retrieving and analyzing the artifacts necessitate allocating large amount of computing resources </li></ul></ul>
  4. 4. Introduction and Motivation 2/2 <ul><li>Challenges cont. </li></ul><ul><ul><li>With current technologies (RDBMS) trade-offs need to be made between the amount of captured data and the granularity levels </li></ul></ul><ul><ul><li>Aggregation is a solution but lacks of data that would enable drill-down actions in context of root-cause analysis (e.g. missing low-level events) </li></ul></ul><ul><ul><li>Artifact types and the granularity level already implies that there is certain knowledge available for the analytics-phase. </li></ul></ul><ul><ul><ul><li>Might be good enough to satisfy legal requirements or certain compliance applications </li></ul></ul></ul><ul><ul><li>In general leaving out data reduces the opportunities for better insight and the possibility of gaining new knowledge about the process. </li></ul></ul>
  5. 5. What is the problem with traditional technologies? <ul><li>Data Warehousing </li></ul><ul><ul><li>They lack in providing process knowledge and thus hinder operational insights. </li></ul></ul><ul><li>Complex Event Processing </li></ul><ul><li>(post-event analysis) </li></ul><ul><ul><li>Events can be branched off and stored from continuous streams </li></ul></ul><ul><ul><li>The relationships (i.e. correlations) are preserved </li></ul></ul><ul><ul><li>Information allows to derive insights about the business processes </li></ul></ul><ul><ul><li>Based on RDBMS for which the storing and querying large (web-scale) amount of data is costly </li></ul></ul><ul><ul><li>Scaling with RDBMS (Storage and performance) comes with high investments due to specialized hardware and license costs. </li></ul></ul><ul><ul><li>Investments do not necessarily justify the potential benefits. </li></ul></ul>
  6. 6. Why cloud-based storage? <ul><li>Provides the illusion of infinite computing and data storage resources </li></ul><ul><li>Organizations can increase resources with increasing demands </li></ul><ul><ul><li>Eliminates large up-front commitments (low and incremental costs) </li></ul></ul><ul><ul><li>Can satisfy short-term requirements </li></ul></ul><ul><ul><li>Cover peak loads </li></ul></ul><ul><ul><li>Hot Deployment of resources </li></ul></ul><ul><li>Simpler and faster maintenances </li></ul><ul><ul><li>Support huge datasets and high request rates based on large number of commodity servers </li></ul></ul><ul><ul><li>Resources capacities can me modified on demand and in a timely manner by adding, changing or removing instances </li></ul></ul><ul><li>Provide a high level of availability, seamless failover and recovery handling across heterogeneous commodity hardware landscapes </li></ul>Scalability Elasticity Availability Characteristics Benefits Cloud-based storages sacrifice the complex query capabilities and sophisticated transaction models found in traditional systems
  7. 7. Provenance System Overview
  8. 8. Hbase Data Model <ul><li>Tables don’t have a defined schema (i.e. each row of a table can have different attributes) </li></ul><ul><li>Columns are grouped by column-families </li></ul><ul><li>Each row has a sorted key and a timestamp </li></ul><ul><li>Everything except the tablename is stored as byte[] </li></ul>Characteristics
  9. 9. Data Integration Schema-less structure easily allows to “dump” everything into data storage following a LET (Load Extract Transform) paradigm in contrast to classical ETL approaches <ul><li>Get all data independent of it’s source and type </li></ul><ul><ul><li>You might never know what data you want to analyze at a later point of time </li></ul></ul><ul><ul><li>There is no need to make a compromise here as the storage is relatively cheap </li></ul></ul><ul><ul><li>The space is available </li></ul></ul><ul><ul><li>The performance is preserved trough horizontal scaling of the data </li></ul></ul>
  10. 10. Data Indexing <ul><ul><li>Create a inverted index for the extracted property </li></ul></ul><ul><ul><ul><li>IndexTableName: Attributename </li></ul></ul></ul><ul><ul><ul><li>Key: Value + KeyOfRow (John$$b2f59d10-903d-…) </li></ul></ul></ul><ul><ul><ul><li>Value: dummy (not used) </li></ul></ul></ul><ul><ul><li>Reference to KeyOfRow form the indexed table is encoded into the key of the indextable in order to be able to perform range scans. Otherwise the columns would grow extremely large </li></ul></ul>Key Value Szabolcs_Ref1 Szabolcs_Ref2 Alek_Ref3 … …
  11. 11. Composite Indexing <ul><li>Composite Indexes allows to optimized towards fast querying </li></ul><ul><li>Example: </li></ul><ul><ul><li>Search for firstname and lastname </li></ul></ul><ul><ul><li>Composite Index </li></ul></ul><ul><ul><ul><li>Tablename: AttributenameA + AttributenameB </li></ul></ul></ul><ul><ul><ul><li>Key: firstname + lastname + Ref1 </li></ul></ul></ul><ul><ul><ul><li>Value: dummy </li></ul></ul></ul>Key Value Szabolcs$Rozsnyai _Ref1 Szabolcs$Rozsnyai _Ref2 Alek$Slominski_Ref3 … …
  12. 12. Querying <ul><li>Bad News </li></ul><ul><ul><li>Simple Key Lookups to retrieve values are easy to realize but there are is no declarative query language or any means to express more sophisticated constructs such as joins </li></ul></ul><ul><ul><ul><li>No optimizations on declarative queries </li></ul></ul></ul><ul><ul><li>Queries often require set operations (such as intersections) </li></ul></ul><ul><ul><ul><li>There is no facility/algorithm out-of-the-box that deals with efficient memory usage for instance </li></ul></ul></ul><ul><ul><ul><li>SQL Queries Algorithms need to be “re-implemented” </li></ul></ul></ul><ul><li>Good News </li></ul><ul><ul><li>Applications (such as Provenance) have a well defined set of (parameterized) queries </li></ul></ul><ul><ul><li>Most of key-store implementation stores keys in sorted order and supports range scan on keys with paging (prefix, startkey, page size, …) </li></ul></ul><ul><ul><ul><li>Otherwise we would need to do to break list into pieces (ex. file inode-like structure) </li></ul></ul></ul>
  13. 13. Querying <ul><li>Simple Queries: </li></ul><ul><ul><ul><li>Search By Attribute </li></ul></ul></ul><ul><ul><ul><li>Boolean Search </li></ul></ul></ul><ul><li>Filter Query </li></ul><ul><li>Traversing graphs of Relationships </li></ul>
  14. 14. Querying <ul><li>Returns all rows where the specified attribute corresponds the given search value </li></ul>List<PStoreRecord> recordRetList = recordDAO.searchByAttribute(“person:firstname”, “Szabolcs&quot;); <ul><li>Allows to combine search value lookups </li></ul><ul><li>If there has been a composite index defined for the three attributes in the example the implementation has to perform only one lookup in the index </li></ul>// create searchTerms HashMap<String, String> searchTerms = new HashMap<String, String>(); searchTerms.put(&quot;person:firstName&quot;, &quot;Michael1&quot;); searchTerms.put(&quot;person:lastName&quot;, &quot;Smith1&quot;); searchTerms.put(&quot;person:userId&quot;, &quot;msmith1&quot;); List<PStoreRecord> resultRecordList = recordDAO.searchBooleanOperator(searchTerms, HBasePStoreRecordDAO.AND_OPERATOR); <ul><ul><ul><li>Search By Attribute </li></ul></ul></ul><ul><ul><ul><li>Boolean Search </li></ul></ul></ul>
  15. 15. Filter Operator <ul><li>Performs joins over several “relations”, can be used to represent (correlation) rules from the Provenance </li></ul><ul><li>Example: </li></ul><ul><li>WHERE </li></ul><ul><li>OrderReceived.userId = “srozsnyai213“ AND </li></ul><ul><li>  </li></ul><ul><li>OrderReceived.orderId = ShipmentCreated.orderId AND </li></ul><ul><li>ShipmentCreated.shipmentId = TransportStarted.shipmentId AND </li></ul><ul><li>TransportStarted.TransportId = TransportEnded.TransportId AND </li></ul><ul><li>  </li></ul><ul><li>OrderReceived.Type = “OrderReceived“ AND </li></ul><ul><li>ShipmentCreated.Type = “ShipmentCreated“ AND </li></ul><ul><li>TransportStarted.Type = “TransportStarted“ AND </li></ul><ul><li>TransportEnded.Type = “TransportEnded“ </li></ul>
  16. 16. Some Evaluation Process simulator relating to an export compliance regulations use-case Wide range of heterogeneous systems (Order Management, Document Management, E-Mail, Export Violation Detection Services, … ) as well as workflow-supported human-driven interactions (Process Management System). All of those systems generate a wide range of events at different granularity levels which allows us to create a comprehensive graph of relationships. Operation No of Operations Type of Operation Inserting Record 1 per record Write Inverted Indexing 1 per attribute per record Write Composite Indexing 1 per attribute per record Write Search By Attribute 1 per search Scan with prefix filter 1 per reference retrieved from index Read Boolean Search w. Composite Index 1 per search Scan with prefix filter 1 per reference retrieved from index Read Boolean Search w.o. Composite Index 1 per sub-expression connected with a boolean operator in a search Scan with prefix filter 1 per reference retrieved from index for one expression Read Filter Query For a sub-expression with a join Boolean Search is executed and for the rest a Search by Attribute
  17. 17. Future and Ongoing Work <ul><li>(Distributed) Business Process Analytics </li></ul><ul><ul><li>Correlation Discovery </li></ul></ul><ul><ul><li>Process Mining </li></ul></ul><ul><ul><li>Predictive Analytics </li></ul></ul><ul><ul><li>Improve query expressiveness (Hive, Pig, …) </li></ul></ul>