• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Web Semantics in the Clouds
 

The Web Semantics in the Clouds

on

  • 555 views

In the last two years, the amount of structured data made available on the Web in semantic formats has grown by...

In the last two years, the amount of structured data made available on the Web in semantic formats has grown by
several orders of magnitude.To find out more insights read the complete whitepaper.

Statistics

Views

Total Views
555
Views on SlideShare
555
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Web Semantics in the Clouds The Web Semantics in the Clouds Document Transcript

    • The Semantic Web Editor: Steffen Staab University of Koblenz-Landau Web Semantics in the Clouds staab@uni-koblenz.de Peter Mika, Yahoo! Research Giovanni Tummarello, Digital Enterprise Research Institute I n the last two years, the amount of structured data made available on the Web in semantic formats has grown by several orders of magnitude. On one side, the Linked Data performance per CPU and high-end CPU interconnection technologies, are proven tools that have scored great suc- cesses in physics, astronomy, chemistry, biology, and many other fields. In general, researchers find it hard to surpass effort has made available online hundreds of millions of such high-end machines when they’re facing problems that tend to be hard to parallelize or when they need intense in- entity descriptions based on the Resource Description terprocess communication. Framework (RDF) in data sets such as DBPedia, Uniprot, When this isn’t the case, however, cluster-computing ap- and Geonames. On the other hand, the Web 2.0 community proaches typically exhibit far greater flexibility in resource has increasingly embraced the idea of data portability, and utilization and much lower overall costs. In an extreme the first efforts have already produced ­billions of RDF- case, clouds can extend to the entire Internet: computa- equivalent triples either embedded inside HTML pages us- tions involving relatively small data blocks, with no coor- ing microformats or exposed directly using eRDF (embed- dination needs and no significant constraints on execu- ded RDF) and RDFa (RDF attributes). tion times, have been performed across the Internet—for Incentives for exposing such data are also finally becom- example, using free desktop cycles such as in the SETI or ing clearer. Yahoo!’s SearchMonkey, for example, makes Folding@ projects. Web sites containing structured data stand out from others The computations needed in Web data processing lie by providing the most appropriate visualization for the end somewhere between these extremes. On the one hand, user in the search result page. It will not be long, we envi- Web data is interlinked, and the analysis of its mesh has sion, before search engines will also directly use this infor- proven to be fundamental in gaining insights about its im- mation for ranking and relevance purposes—returning, for plicit nature. At the same time, however, the data is by na- example, qualitatively better results for queries that involve ture distributed and can be said to be consistent with itself, everyday entities such as events, locations, and people. if at all, only within the boundaries of a single Web site. Even though we’re still at the beginning of the data Web Furthermore, a prominent characteristic is definitely the era, the amount of information already available is clearly sheer amount of such data, with a petabyte being a com- much larger than what could be contained, for example, in mon order of magnitude.1 any current-generation triple store (a database for storing To process this kind of data, leading Internet search and retrieving RDF metadata) typically running on single providers have been pioneering ways to perform large- servers. scale Web data computations on clusters of commodity Although many applications will need to work with large machines interconnected with mainstream networking amounts of metadata, one particular application would cer- technology. Google’s publications about its MapReduce tainly not exist without the capability of accessing and pro- framework2 and more recently the Yahoo!-initiated, open cessing arbitrary amounts of metadata: search engines that source implementation Hadoop (www.hadoop.apache.org) locate the data and services that other applications need. For are attracting increasing attention from developers and us- this reason, Semantic Web search engines and large-scale ers alike. services are now the first in harnessing grid computing’s The MapReduce style of computation works well with power when it comes to scaling far beyond the current gen- the possibilities offered by the emerging cloud-computing eration of triple stores. paradigm. This computation style provides generally use- ful abstractions so that developers can focus on the task at Cloud computing for the web of data hand (see the sidebar “More about These Technologies”). Not all computationally intense problems require simi- Executing computations “in the clouds” refers to the model larly structured hardware configurations. Classic super- in which an application requests computational resources computers, typically characterized by superior arithmetic from a service provider without needing to bother with the82 1541-1672/08/$25.00 © 2008 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society
    • More about These Technologies Although distributed computing is almost as old as computer HBase networks, the paradigm has been rapidly gaining popular- HBase (http://hadoop.apache.org/hbase) builds on Hadoop ity recently under the name cloud computing. This surge in to provide a virtual database abstraction similar to the origi- interest is partly due to limitations of single-machine hard- nal BigTable system.2 An HBase table always has a column ware architectures but is also the result of improvements in that serves as the key; rows can only be located by this key’s software abstractions that hide the complexity of underlying value (that is, the table has a single index). Thus, HBase ta- hardware architectures from the programmer. Three of these bles can also be considered key-value pairs, where the keys software layers—Hadoop, HBase, and Pig—all implement are unique and the value provides the content for the non- data structures (maps and relational tables, for instance) and key columns. BigTable was originally developed for storing processing pipelines that are familiar to most programmers. the results of crawling, and HBase preserves some of these In cases where the developer can map the problem at hand characteristics such as the ability to store different versions to these solution spaces, the task of cluster programming of cell contents, which can be distinguished by time stamp. becomes almost as simple as developing software for single HBase doesn’t provide a join operation on these tables, only machines (although testing might be more involved). All simple lookups by key. HBase tables are stored on the distrib- three are Java-based open-source projects developed under uted file system in an indexed form to support this. the Apache organization. The descriptions we provide de- scribe only the systems’ core functionality. Pig Pig (http://incubator.apache.org/pig) also builds on Hadoop Hadoop but goes a step further toward database-like functional- Hadoop (http://hadoop.apache.org/core) implements a dis- ity. A table in Pig is a bag of tuples, in which each field can tributed file system (HDFS) and a MapReduce framework hold a value or a bag of tuples (that is, nested tables are also similar to the Google File System and Google’s original possible). Tables in Pig aren’t stored on disk in any special implementation of MapReduce.1 HDFS provides a virtual form but are loaded by a custom load function the program- disk space that is physically distributed across the machines, mer defines. (This has the disadvantage that the data isn’t where the data is replicated to withstand the failure of indexed by any key.) PigLatin, Pig’s scripting language, is pro- single machines and speed up processing. MapReduce pro- cedural—that is, writing a Pig script is more similar to writing vides a simple processing pipeline consisting of two phases, a “query plan” than to writing a query in SQL. Nevertheless, a Map and a Reduce phase. The machines in the cluster work most users find learning PigLatin easy because it provides all in parallel in both phases, running identical jobs but on dif- the familiar constructs of SQL such as filtering (projection), ferent slices of the data. MapReduce builds on HDFS in that joins, grouping, sorting, and so on. PigLatin can also be ex- the input, output, and intermediate results are stored on the tended using custom functions. Furthermore, it can also be distributed file system. embedded inside Java to provide additional control such as The input and output of the Map phase is a bag of (key, conditionals and loops. PigLatin scripts are translated into a value) pairs, in which the keys need not be unique. The data sequence of MapReduce jobs. on disk can be in any structure as long as the developer can provide a function that loads it into key-value pairs. Keys and values can be arbitrarily complex Java objects as long as they implement the necessary interfaces. At the end of the Map phase, the system collects the pairs with the same key and sends them to separate machines. The Reduce-phase jobs then operate on a bag of pairs with the same key to produce References 1. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Pro- some output. (The output most often also consists of key- cessing on Large Clusters,” Proc. 6th Symp. Operating System value pairs to serve as input for some other processing.) So, Design and Implementation (OSDI 04), Usenix Assoc., 2004. the Reduce phase can begin only after all the Map tasks are 2. F. Chang et al., “Bigtable: A Distributed Storage System for done. However, this is the only synchronization point and Structured Data,” Proc. 7th Symp. Operating System De- the only interprocess communication, thus greatly simplify- sign and Implementation (OSDI 06), Usenix Assoc., 2006, pp. ing implementation. 205–218.computational offer’s details. An example cloud-like infrastructures is an appeal- choice. On the one hand, many tasks, suchthat’s recently enjoying popularity is the ing strategy for performing data-intensive as crawling, can be performed in a fashionAmazon Elastic Computing service offer, computations while minimizing, or opti- similar to the processing of regular Webwhich allows computing capabilities to be mizing, the upfront infrastructure invest- content (HTML pages). Semantic dataallocated within minutes and increased ment. The combined paradigm is generally crawling often requires special treatmentdynamically as required—for example, to referred to as data-intensive scalable (or and dedicated intelligence but doesn’t gen-quickly cope with an unexpected peak of super)computing (DISC). erally differ very much from crawling thevisitors. Because MapReduce is agnostic With respect to the complex tasks in- HTML Web. Similarly, ranking a Semanticwith regard to the actual size of the clus- volved in processing the Web of data, a Web source such as sites or data sets on theter it runs on, executing Hadoop on such DISC approach is for many reasons a natural basis of algorithms similar to PageRank, andSeptember/October 2008 www.computer.org/intelligent 83
    • thus efficiently computed over ­MapReduce, It was originally developed inside Yahoo! We implemented these results as a new has been proposed. On the other hand, many Research but has been recently made avail- back end for the popular Sesame triple other data-intensive, batch-processing tasks able as open source under the Apache 2.0 store.6 Sesame comes with several com- are needed for specifically addressing the license. Pig natively provides support for ponents for storing and retrieving tuples challenges of the Web of data. Examples data transformations such as projections, (including an in-memory implementation, include large-scale data analysis, cleaning, grouping, sorting, joining, and composi- a native disk-based implementation, and a reasoning, entity recognition and consolida- tions. The expressivity of Pig’s transforma- relational-database-management-system- tion, and ontology mapping. tion language is roughly equivalent to stan- based implementation) but also lets us plug Grid computing is certainly useful in dard relational algebra (which also forms in additional back ends with minimal effort. some of these tasks. We first show how the basis of SQL), with the added benefit As a result, Sesame applications can switch Yahoo! is building on grid computing us- of extensibility through custom functions transparently to a Pig-based RDF store when ing Hadoop to enable the analysis, trans- written in Java. Pig programmers develop scalability requires it without requiring any formation, and querying of large amounts custom code for loading and saving data in changes to the application code. of RDF data in a batch-processing mode other formats into Pig’s data model, which We evaluated our system by performing using clusters of hundreds of machines, again builds on the relational model (bags the common Lehigh University Benchmark without apparent bottlenecks in scalabil- of tuples) with additional features such as (LUBM). We experimented with the first ity. Next, we show how the Semantic Web two queries of the LUBM set, varying the search engine Sindice is exploiting Hadoop number of nodes used for computation and and related technologies to scale seman- the size of the data. tic indexing beyond the limits of dedicated cluster environments while reducing cost Using the Sindice API, it’s Figure 1 shows Pig’s performance us- ing the same query (the first one in the test and complexity.3 possible to search for set) on different sizes of test data and with varying numbers of nodes. In Figure 2, Batch-processing RDF using Yahoo! Pig people, places, events, and we show how performance changes as we move from the first, simple query to the connections on the basis Yahoo! is building on grid computing us- second, more complex query. The latter re- ing Hadoop to enable the analysis, trans- quires five joins instead of one. formation, and querying of large amounts Both figures show that asymptotically the of RDF data in a batch-processing mode of semantically structured execution time is more or less independent using clusters of hundreds of machines, of the data set’s size and is around 100 sec- without apparent bottlenecks in scalability. documents found on the Web. onds. In other words, no matter how large The Yahoo! crawler affectionately named a data set is, given sufficient resources this Slurp began indexing microformat content query can be executed in about 100 seconds. in the spring of this year, and the company On the other hand, we consider this a lower recently added eRDF and RDFa to its sup- maps and nested bags of tuples. bound, because it’s associated with the fixed ported formats. Yahoo! has also innovated Scripts written in PigLatin, Pig’s native costs of allocating nodes and distributing the in the Semantic Web area by allowing language, are executed on the cluster using job. This confirms that—as expected—our site owners to expose metadata using the the Hadoop framework or Galago (www. solution isn’t applicable to online, interac- D ­ ataRSS format, an Atom-based format for galagosearch.org), a tuple-­processing en- tive tasks. delivering RDF data. The Yahoo! Search- gine. In contrast to HBase, Pig can’t actu- The results also show that the improve- Monkey application platform will likely ally be called a database: processing takes ment from adding additional nodes depends produce a further explosion in the amount place by iterating through the whole data on whether the execution is balanced. A of data handled. SearchMonkey developers set (the data isn’t indexed and can’t be balanced execution is one where each node can create so-called custom data services to updated), and the results of computation processes an equal number of blocks—that extract metadata from existing Web sites or are saved. HBase, however, doesn’t offer a is, when turn APIs into sources of metadata. All the query language, only the retrieval of tuples metadata collected by Yahoo! is stored in by their index.  b  2∗m  = k an RDF-compatible format, so processing it We observed that Pig’s data model and   requires the ability to query and transform transformation language are similar to the large amounts of RDF data. relational representations of RDF and the where k ∈ N+, b is the number of blocks, Yahoo! aims to develop data solutions Sparql query language, respectively, so and m is the number of machines. Between that address wide classes of data man- we recently extended Pig to perform RDF points that result in balanced executions, agement problems and that can easily be querying and transformations. As part of additional nodes won’t improve perfor- adapted to new problems, such as the one this work, we implemented load and save mance, because performance is determined posed by SearchMonkey.4 One of these functions to convert RDF to Pig’s data by the nodes that need to process the larger tools, Yahoo! Pig, simplifies the processing model, created a mapping between Sparql number of blocks. However, we do observe of large data sets on computer clusters by and PigLatin, and proved that the mapping the expected linear scale-up for balanced applying concepts from parallel databases.5 is complete. execution without any notable communica-84 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
    • tion overhead, even when scaling up to hun-dreds of nodes. Furthermore, this holds true 600for the second, more complex query: execu-tion time increases proportionally with thequery, but the general scaling behavior stays 500the same. Although we don’t report the results in Data sets: 400 Seconds to completethis article, we also experimented with lubm-q1.pig - 50RDF reasoning using a forward-chain- lubm-q1.pig - 100ing algorithm. Reasoning in this scenario lubm-q1.pig - 1,000 300requires executing queries iteratively andadding the results to the data set until nomore new triples are produced. Also, this 200method is easy to extend with rule-basedimplementations of OWL subsets. As for the limitations of the approach, 100Pig processing provides a solution only too­ ffline batch-processing tasks because ofthe overhead of distributing the job to pro- 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32cessing nodes and copying data if neces-sary. Also, because no index is provided, Number of nodesupdates require full parsing. (On the otherhand, adding data is as simple as copy- Figure 1. Pig’s performance on Query 1 of the Lehigh University Benchmark (LUBM)ing it to a directory of the distributed file using varying numbers of nodes. Different curves represent different sizes ofsystem.) Despite these drawbacks, a Map­ data—that is, the LUBM 50, LUBM 100, and LUBM 1,000 data sets.Reduce-based infrastructure is most likelyto offer the best resource utilization in an 2,000analytical or research scenario by relyingonly on commodity hardware and offering 1,800a general-­purpose computing platform. Inour research environment, the same cluster 1,600of machines that’s used to analyze ad clicksor query logs is used to query and reason 1,400 Seconds to completewith metadata, because both tasks can be 100 - lubm-q1.pig 1,200 100 - lubm-q2.pigfitted to the same paradigm. The system al-locates computing nodes dynamically and 1,000releases them back to the pool as soon asthe job is finished. 800 Because of their versatility, MapReduceclusters are used in production for large- 600scale data processing at both Google2 andYahoo! (http://developer.yahoo.com/blogs/ 400hadoop/2008/02/yahoo-worlds-largest- 200production-hadoop.html). However, giventhat the software is free and requires no 0special hardware, this solution is also open 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23to universities that must perform experi- Number of nodesments with large amounts of metadata. Figure 2. Pig’s performance on two queries, both taken from the LUBM benchmark.Sindice: Large-scale Q1 is simpler, and Q2 is more complex using varying numbers of nodes.processing of structuredWeb data dice API, for example, it’s possible to use microformat, XML Social Network (XFN)Developed by the Digital Enterprise Re- keywords and semantic-pattern queries to information, and so on.search Institute’s Data Intensive Infrastruc- search for people, places, events, and con- Technically, Sindice is designed to pro-tures group (http://di2.deri.ie), the Sindice nections on the basis of semantically struc- vide these services, fulfilling three mainproject deals with building scalable APIs tured documents found on the Web. This nonfunctional requirements: scalability,to locate and use RDF and microformat includes, for example, FOAF (friend-of-a- runtime performance, and ability to copedata as found on the Web. By using the Sin- friend) files, HTML pages using the hcard with the many changes in standards andSeptember/October 2008 www.computer.org/intelligent 85
    • Physical storage Data processing Index storages a nonnegligible computational burden on the remote site. This is similar to the URI [URI] cronjob problem of Deep Web crawling, which the WWW Autopinger Crawler database Update Sitemap protocol (www.sitemaps.org) is [URI] List to crawl Web crawling request currently addressing. Similarly, an exten- Ping processor sion to the Sitemap protocol (http://sw.deri. [URI] URI data org/2007/07/sitemapextension) allows Crawler Sitemap manager data sets to be described in a way that the [Sitemaps] crawler can download the data as a dump RDF or HTML and avoid retrieving each individual URI with �Formats + metadata (uniform resource identifier) referenced in Indexing the data. Processing such dumps is data HDFS pipeline intensive, requiring RDF processing at the Description Ontology triple level to create entity representations, Indexing Triplifier repository which are then individually indexed. Simi- Distributed Entity processing Quad store Ontology lar to what we described earlier about Pig, document Graph reasoner storage the Sindice crawler performs this task ef- URI score Ranker storage Update manager ficiently across the cluster using a MapRe- n3 documents duce implementation.7 HBase After collection, data is stored in its raw form, mostly HTML or RDF, in the HBase Front-end Document list distributed column-based store (see the Postindexing applications Main index sidebar “More about These Technologies”). Front end For complex data analysis, HBase consti- Distributed APIs tutes the distributed storage medium with nTriple [URI scores] possibilities for structured queries—albeit document storage Ranker limited—as well as input for any analytics implemented with MapReduce. To sup- port runtime requirements of data analysis Object A storing data in B A B tasks, the Sindice index must be precom- Object B drafting data from A A B puted and fully expanded. This is done in three steps. Figure 3. The Sindice architecture. In the Sindice indexing pipeline, documents are First, the Sindice indexing pipeline pre- first discovered and crawled from the Web. The pipeline then extracts data from the processes raw data from HBase. It analyzes documents and performs reasoning and entity consolidation before creating the the HTML using the Hadoop pipeline to index that’s used for serving clients. extract RDFa, microformats, and many other significant elements and summary in- usage practices on the Web of data. Fig- Web crawler that collects Web pages and formation. These include statistics but also ure 3 shows a sketch of Sindice’s internal RDF documents for the Sindice search extra bits of information such as the page’s architecture. engine) employs Hadoop/Nutch (http:// title, the presence of RSS or other machine- In terms of scalability, the goal is to lucene.apache.org/nutch) to distribute the processible elements, and context elements handle massive amounts of data on the Web, crawling jobs across multiple machines. such as keywords found in the text. At this with predicted scalability needs in the range The main crawling strategy is not unlike point, the document’s semantics, regard- of trillions of triples. We achieve this by us- what’s commonly used in Web harvesting, less of the original format, is represented in ing information retrieval techniques to an- but specific tasks differ and require data- RDF and proceeds in the pipeline for rea- swer both textual and semantic queries over intensive operations. One such case is the soning and consolidation. large collections of structured documents. efficient handling of large data sets pub- Second, the Sindice indexing pipeline This approach trades limited query complex- lished using the Linked Data model (a set applies reasoning to fact sets. Reasoning ity (for example, only allowing simple pat- of best practices for publishing and deploy- is important when indexing native RDF tern queries and limited join capability) for ing instance and class data using the RDF documents, because it makes information high scalability, as we see in traditional Web data model). ­DBPedia.org, for example, explicit and therefore directly available for search engines. Although answering queries exposes several million virtual RDF files indexing purposes. As an example, in the doesn’t require cloud-computing techniques, derived on demand by querying the under- case of social-networking data described these are fundamental for Sindice to reach its lying data set. using the FOAF ontology, it wouldn’t be scalability goals during the entire prepro- Crawling one such site requires consid- possible to answer a query for “entities cessing phase and in particular to perform erable time, poses limits to the frequency labeled ‘giovanni’” unless you could infer harvesting, storage, and data transformation. of recrawling and therefore of information that a foaf:name is a subproperty of rdfs:label. At the harvesting level, SindiceBot (a updates by the search engine, and imposes Microformats, on the other hand, require86 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
    • even more sophisticated handling—for processing backbone of the Web of HTML scalability problems will likely turn out toexample, to extract the named entities pages, are also now being exploited to be less challenging after all.from a description text and turn them into deal with the explosion of the Web of data,triples in the RDF representation. Effi- both in research and in production envi-cient up-front reasoning requires that Map­ ronments. Hadoop’s MapReduce frame-Reduce jobs employ some form of state, work and extensions to it such as HBase Acknowledgmentswhich is unusual: MapReduce jobs must and Yahoo! Pig are being employed to do We acknowledge the support of Ben Reed ofbe functional by definition, which stands in large-scale processing of arbitrarily shaped Yahoo! Research in carrying out some of thecontrast with the need to reuse previously semantic data sets with no scalability lim- research described. We also acknowledge theobtained reasoning results (for instance, its in sight. As always, conforming to a support of Renaud Delbru, Michele Catasta,reasoning on TBox statements of reused framework entails the need for adaptation: Gabriele Renzi, Paolo Capriotti, Holger Sten-ontologies should be performed just once). both data and algorithms must be recast so zhorn, and Richard Cyganiak for other parts of our research.Our solution is based on the use of several that they fit platform design. However, in“reasoning servers” shared across MapRe- Semantic Web research—and in Web sci-duce jobs, which cache reasoning results in ence in general—we can study some of the Referenceslarge main memories. most interesting phenomena such as the 1. F. Chang et al., “Bigtable: A Distributed Third, entity consolidation is required emergence of semantics only on large data Storage System for Structured Data,” Proc.to establish appropriate cross-references sets. So, there’s no way to escape the com- 7th Symp. Operating System Design and Implementation (OSDI 06), Usenix Assoc.,between the data and its index. Let’s as- promises required for addressing problems 2006, pp. 205–218.sume that the engine has indexed documents of scale. 2. J. Dean and S. Ghemawat, “MapReduce:containing a given term—for example, URI1. Cloud-computing techniques have been Simplified Data Processing on Large Clus-If the engine learns that URI1 owl:sameAs URI2, well known for large commercial players, ters,” Proc. 6th Symp. Operating Systemit will need to reindex all the originally in- but they’re now within the reach even of Design and Implementation (OSDI 04), Usenix Assoc., 2004, pp. 137–150.dexed documents to add the equivalence of academic research labs and small and me- 3. E. Oren et al., “Sindice.com: A Document-URI2 in the document term index. Similarly, dium-scale enterprises. Thanks to the use Oriented Lookup Index for Open Linkedthe indexer often needs to modify the rules of inexpensive, commodity hardware and Data,” Int’l J. Metadata, Semantics, andfor data transformation—for instance, from open source implementations, experiment- Ontologies, vol. 3, no. 1, 2008.microformats to RDF—to increase the qual- ing with these techniques is easy, even 4. R. Baeza-Yates and R. Ramakrishnan,ity of the search results or to address new with very low budgets. Adopting cloud- “Data Challenges at Yahoo!” Proc. 11th Int’l Conf. Extending Database Technol-usage patterns or new standards. Once this c ­ omputing techniques, however, requires ogy (EDBT 08), ACM Press, 2008, pp.is done, the engine must reindex most of the a special kind of expertise. To aid in this 652–655.data starting back from the HTML form. process, cloud computing must have a sta- 5. C. Olston et al., “PigLatin: a Not-So- At the output of the indexing pipeline, ble place in academic curricula. Although F ­ oreign Language for Data Processing,”the RDF documents are now consolidated several universities are starting to provide Proc. 2008 ACM Sigmod Int’l Conf. Man- agement of Data (Sigmod 08), ACM Press,and contain explicit semantic statements courses on these paradigms, a notable lag 2008, pp. 1099–1110.together with consolidation information. exists between academia and the research 6. J. Broekstra, A. Kampman, and F. vanFinally, these are both sent to the index and centers of large commercial Internet play- Harmelen, “Sesame: An Architectureadded to HBase for later retrieval and fur- ers. Europe, it seems, is also lagging a bit. for Storing and Querying RDF and RDFther offline processing. We can only wish that this theme becomes Schema,” Proc. 1st Int’l Semantic Web Conf. (ISWC 02), LNCS 2342, Springer, Having illustrated the data-intensive pro- increasingly recognized as a strategic fo- 2002, pp. 54–68.cesses in a semantic-search engine, the need cus for achieving competitiveness in the 7. R. Cyganiak et al., “Exposing Large Datafor distributing storage and effectively shar- Web market. Sets with Semantic Sitemaps,” Proc. 5thing the computational load across the clus- Finally, with respect to the Semantic European Semantic Web Conf., LNCSter is evident. In Sindice, we’ve found that Web research community, we are very in- 5021, Springer, 2008, pp. 690–704.Hadoop and the associated technology stack terested in continuing to develop Semanticeffectively address data-intensive tasks, Web algorithms cast into the MapReducewhether for data management or analysis, framework or its higher-level abstractionsincluding those specific to processing struc- such as Pig and HBase. We believe we cantured data. In our experience, the indexing successfully transform some of the research Peter Mika is a researcher at Yahoo! Re- search. Contact him at pmika@yahoo-inc.pipeline’s processing throughput can grow problems we’ve been facing into the well- com.almost linearly with the number of serv- understood MapReduce paradigm anders. In the current setup, Sindice’s cluster of then apply solutions based on open source Giovanni Tummarello is a research fel-eight machines can process approximately implementations and commodity hard- low, head of the Data Intensive Infrastruc- tures research unit, and project leader for150 documents per second. ware. We call on the research community the Sindice Data Web Search Engine proj- to explore the entire range of Semantic Web ect (http://sindice.com) at the Digital En- algorithms that could be successfully trans- terprise Research Institute. Contact him atCloud-computing techniques, already the formed into this increasingly popular solu- giovanni.tummarello@deri.org. tion space. Many of the Semantic Web’sSeptember/October 2008 www.computer.org/intelligent 87
    • This article was featured in For access to more content from the IEEE Computer Society, see computingnow.computer.org.Top articles, podcasts, and more. computingnow.computer.org