Data model for analysis of scholarly documents in the MapReduce paradigm

1,843 views

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,843
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
31
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Data model for analysis of scholarly documents in the MapReduce paradigm

  1. 1. Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan Dendek Dominika Tkaczyk Centre for Open Science (CeON), ICM UW Warsaw, July 6, 2012 (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19
  2. 2. Agenda 1 Problem definition 2 Requirements specification 3 Exemplary solutions based on Apache Hadoop Ecosystem tools (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 2 / 19
  3. 3. The data that we are in possessionVast collections of scholarly documents to store 10 million of full texts (PDF, plain text) 17 million of document metadata records (described in XML-based BWMeta format) 4TB of data (10TB including data archives) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 3 / 19
  4. 4. The tasks that we are doingBig knowledge to extract and discover 17 million of document metadata records (XML) contain title, subtitles, abstract, keywords, references, contributors and their affiliations, publishing magazine, . . . input for many state-of-the-art machine learning algorithms relatively simple ones: searching documents with given title, finding scientific teams, . . . quite complex ones: author name disambiguation, bibliometrics, classification code assignment, . . . (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 4 / 19
  5. 5. The requirements that we have specifiedMultiple demands regarding storage and processing of large amounts of data: scalability and parallelism — easily handle tens of terabytes of data and parallelize the computation effectively flexible data model — possibility to add or update data, and enhance it content by implicit information discovered by our algorithms latency requirements — support batch offline processing as well as random, realtime read/write requests availability of many clients — accessible by programmers and researchers with diverse language preferences and expertise reliablility and cost-effectiveness — ideally an open-source software which does not require expensive hardware (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 5 / 19
  6. 6. Document-related data as linked dataInformation about document-related resources can be simply described a directedlabeled graph entities (e.g. documents, contributors, references) are nodes in the graph relationships between entities are directed labeled edges in the graph (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 6 / 19
  7. 7. Linked graph as a collection of RDF triplesA directed labeled graph can be simply represented a collection of RDF triples a triple consists of subject, predicate and object a triple represents a statement which denotes that a resource (subject) holds a value (object) for some attribute (predicate) of that resource a triple can represent any statements about any resource (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 7 / 19
  8. 8. Hadoop as a solution for scalability/performance issuesApache Hadoop is most commonly used open-source solution for storing andprocessing big data in reliable, high-performance and cost-effective way. Scalable storage Parallel processing Subprojects and many Hadoop-related projects HDFS — distributed file system that provides high-throughput access to large data MapReduce — framework for distributed processing of large data sets (Java and e.g. JavaScript, Python, Perl, Ruby via Streaming) HBase — scalable, distributed data store with flexible schema, random read/write access and fast scans Pig/Hive — higher-level abstractions on top of MapReduce (simple data manipulation languages) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 8 / 19
  9. 9. Apache Hadoop Ecosystem tools as RDF triple stores SHARD [3] — a Hadoop backed RDF triple store stores triples in flat files in HDFS data cannot be modified randomly less efficient for queries that requires the inspection of only a small number of triples PigSPARQL [6] — translates SPARQL queries to Pig Latin programs and runs them on Hadoop cluster stores RDF triples with the same predicate in separate, flat files in HDFS H2RDF [5] — a RDF store that combines MapReduce with HBase stores triples in HBase using three flat-wide tables Jena-HBase [4] — a HBase backed RDF triple store provides six different pluggable HBase storage layouts (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 9 / 19
  10. 10. HBase as storage layer for RDF triplesStoring RDF triples in Apache HBase has several advantages flexible data model — columns can be dynamically added and removed; multiple versions of data in a particular cell; data serialized to a byte array random read and write — more suitable for semi-structured RDF data than HDFS where files cannot be modified randomly and usually whole file must be read sequentially to find subset of records availability of many clients interactive clients — native Java API, REST or Apache Thrift batch clients — MapReduce (Java), Pig (PigLatin) and Hive (HiveQL) automatically sorted records — quick lookups and partial scans; joins as fast (linear) merge-joins (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 10 / 19
  11. 11. Exemplary HBase schema — Flat-wide layout Advantages no prior knowledge about data is required colocation of all information about a resource within a single row support of multi-valued properties support of reified statements (statements about statements) Disadvantages unlimited number of columns increase of storage space (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 11 / 19
  12. 12. Exemplary HBase schema - Vertically Partitioned layout [1] Advantages support of multi-valued properties support of reified statements (statements about statements) storage space savings when compared to the previous layout first-step (predicate-bound) pairwise joins as fast merge-joins Disadvantages increased number of joins (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 12 / 19
  13. 13. Exemplary HBase schema - Hexastore layout [2] Advantages support of multi-valued properties support of reified statements (statements about statements) first-step pairwise joins as fast merge-joins Disadvantages increased of number of joins increase of storage space complication of an update operation (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 13 / 19
  14. 14. HBase schema - other layoutSome derivative and hybrid layouts exist to combine the advantages of originallayouts a combination of the vertically partitioned and the hexastore layout [4] a combination of the flat-wide and the vertically partitioned layouts [4] (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 14 / 19
  15. 15. Challenges a large number of join operations relatively expensive and practically cannot be avoided (at least for more complex queries) but specialized join techniques can be used e.g. multi join, merge-sort join, replicated join, skewed join lack of a native support for cross-row atomicity (e.g. in the form of transactions) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 15 / 19
  16. 16. Possible performance optimization techniques property tables — properties often queried together are stored in the same record for a quick access [8, 9] materialized path expressions — precalculation and materialization of the most commonly used paths through an RDF graph in advance [1, 2] graph-oriented partitioning scheme [7] take advantage of the spatial locality inherent in graph pattern matching higher replication of data that is on the border of any particular partition (however, problematic for a graph that is modified) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 16 / 19
  17. 17. The ways of processing data from HBaseMany various tools are integrated with HBase and can read data from and writedata to HBase tables Java MapReduce possibility to use our legacy Java code in map and reduce methods delivers better perfromance than Apache Pig Apache Pig provides common data operations (e.g. filters, unions, joins, ordering) and nested types (e.g. tuples, bags, maps) supports multiple specialized joins implementation possibility to run MapReduce jobs directly from PigLatin scripts can be embeded in Python code Interactive clients (e.g. Java API, REST or Apache Thrift) interactive access to relatively small subset of our data by sending API calls on demand e.g. a web-based client (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 17 / 19
  18. 18. Case study: author name disambiguation algorithmThe most complex algorithm that we have run over Apache HBase so far isauthor name disambiguation algorithm. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 18 / 19
  19. 19. Thanks! More information about CeON: http://ceon.pl/en/research c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska Tre´´ licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/ sc (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
  20. 20. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. ScalableSemantic Web Data Management using vertical partitioning. In VLDB, pages411–422, 2007.C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing forSemantic Web Data Management. In VLDB, pages 1008-1019, 2008.K. Rohloff and R. Schantz. High-performance, massively scalable distributedsystems using the mapreduce software framework: The shard triple-store.International Workshop on Programming Support Innovations for EmergingDistributed Applications, 2010.V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technicalreport, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-Ext/tech-report.pdfN. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the21th International Conference on World Wide Web (WWW demo track),Lyon, France, 2012. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
  21. 21. A. Sch¨tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping aSPARQL to Pig Latin. 3th International Workshop on Semantic WebInformation Management (SWIM 2011), in conjunction with the 2011 ACMInternational Conference on Management of Data (SIGMOD 2011). Athens(Greece).J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDFGraphs. VLDB Endowment, Volume 4 (VLDB 2011).K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storageand Retrieval in Jena2. In SWDB, pages 131–150.K. Wilkinson. Jena property table implementation. In SSWS, 2006. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19

×