In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstraction makes our approach independent of the actual Hadoop version and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We revisit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 without any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend clickstream data collected by distributed MongoDB data storage with personal data from Oracle into the Data Warehouse system or Analytics platform to provide timely marketing reports. Most of the time the job requires converting a MongoDB JSON document structure into a traditional relational model. Traditional ETL (Extract Transform Load) process still needed to be developed for loading and conversion of unstructured data into traditional analytical tools or Hadoop. In this talk we discuss how to develop a real-time, scalable, fault-tolerant ETL process to integrate MongoDB with traditional RDBMS storage using the open-sourced Twitter Storm project. We will be capturing data streamed by MongoDB oplog or capped collections, transforming it into tables, rows and columns and loading it into a SQL database. We will discuss mongoDB oplog and Storm architecture. The principles discussed in the talk can be used for many other applications - like advanced analytics, continuous computations and so on. We will be using Java as our language of choice but you can use the same software stack with any language.
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
Many companies have huge investment in Data Warehouse and BI tools and want to leverage those investments to process data collected by applications in MongoDB. For example, a company may need to blend clickstream data collected by distributed MongoDB data storage with personal data from Oracle into the Data Warehouse system or Analytics platform to provide timely marketing reports. Most of the time the job requires converting a MongoDB JSON document structure into a traditional relational model. Traditional ETL (Extract Transform Load) process still needed to be developed for loading and conversion of unstructured data into traditional analytical tools or Hadoop. In this talk we discuss how to develop a real-time, scalable, fault-tolerant ETL process to integrate MongoDB with traditional RDBMS storage using the open-sourced Twitter Storm project. We will be capturing data streamed by MongoDB oplog or capped collections, transforming it into tables, rows and columns and loading it into a SQL database. We will discuss mongoDB oplog and Storm architecture. The principles discussed in the talk can be used for many other applications - like advanced analytics, continuous computations and so on. We will be using Java as our language of choice but you can use the same software stack with any language.
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Integrate Solr with real-time stream processing applicationslucenerevolution
Presented by Timothy Potter, Founder, Text Centrix
Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
Node collaboration - Exported Resources and PuppetDBm_richardson
Node Collaboration - How can your servers share information with each other. Exploring Exported Resources, PuppetDB and other methods.
This talk was given at Sydney Puppet Users Meetup on 14/08/2014.
A database trigger is a stored procedure that is executed when specific actions occur within a database. Triggers fit perfectly on a relational schema (foreign keys) and are implemented as a built-in functionality on popular relational database like MySQL.
MongoDB does not have any support for triggers, mainly due to the lack of support for foreign keys. Even if it usually considered an antipattern, there are use cases in MongoDB that benefit from a partially-relational schema. The lack of triggers is an obstacle for a partially-relational schema but there can be workarounds for simulating trigger behavior.
This presentation will guide you through different ways to implement triggers in MongoDB. We will cover the topics streams, tailable cursors, and hooks. We will demonstrate coding examples for each topic and we will explain pros and cons of each implementation.
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
OSDC 2016, Berlin: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer at QAware)
Abstract: How to store billions of time series points and access them within a few milliseconds? Chronix! Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Processing large-scale graphs with Google(TM) PregelArangoDB Database
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
Node collaboration - Exported Resources and PuppetDBm_richardson
Node Collaboration - How can your servers share information with each other. Exploring Exported Resources, PuppetDB and other methods.
This talk was given at Sydney Puppet Users Meetup on 14/08/2014.
A database trigger is a stored procedure that is executed when specific actions occur within a database. Triggers fit perfectly on a relational schema (foreign keys) and are implemented as a built-in functionality on popular relational database like MySQL.
MongoDB does not have any support for triggers, mainly due to the lack of support for foreign keys. Even if it usually considered an antipattern, there are use cases in MongoDB that benefit from a partially-relational schema. The lack of triggers is an obstacle for a partially-relational schema but there can be workarounds for simulating trigger behavior.
This presentation will guide you through different ways to implement triggers in MongoDB. We will cover the topics streams, tailable cursors, and hooks. We will demonstrate coding examples for each topic and we will explain pros and cons of each implementation.
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Spark schema for free with David SzakallasDatabricks
DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.
We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
OSDC 2016, Berlin: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer at QAware)
Abstract: How to store billions of time series points and access them within a few milliseconds? Chronix! Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Processing large-scale graphs with Google(TM) PregelArangoDB Database
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
Szybkie wprowadzenie do technologii Pig i Hive z ekosystemu Hadoop. Prezentacja wykonana w ramach warsztatów Codepot w dniu 29.08.2015. Prezentacja wykonana przez Radosława Stankiewicza oraz Bartłomieja Tartanusa.
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
This slide deck is used as an introduction to the Apache Pig system and the Pig Latin high-level programming language, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
Start developing for the Semantic Web with JAVA. This is an introduction how to tap into your linked data, open linked data and create flexible useful applications.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
ER(Entity Relationship) Diagram for online shopping - TAEHimani415946
https://bit.ly/3KACoyV
The ER diagram for the project is the foundation for the building of the database of the project. The properties, datatypes, and attributes are defined by the ER diagram.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
1. PigSPARQL: A SPARQL Query Processing
Baseline for Big Data
12th International Semantic Web Conference (ISWC)
October 2013, Sydney, Australia
Alexander Schätzle
Martin Przyjaciel-Zablocki
Thomas Hornung
Georg Lausen
University of Freiburg
Databases & Information Systems
2. The next evolution of the World Wide Web (Web 3.0)?
◦ In fact, the amount of Semantic Data increases exponentially
◦ For example, consider the evolution of the Linked Open Data Cloud (LOD)
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 2
Semantic Web
2007 – 12 datasets
taken from [http://lod-cloud.net/]
3. The next evolution of the World Wide Web (Web 3.0)?
◦ In fact, the amount of Semantic Data increases exponentially
◦ For example, consider the evolution of the Linked Open Data Cloud (LOD)
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 3
Semantic Web
2007 – 12 datasets 2009 – 89 datasets
taken from [http://lod-cloud.net/]
4. The next evolution of the World Wide Web (Web 3.0)?
◦ In fact, the amount of Semantic Data increases exponentially
◦ For example, consider the evolution of the Linked Open Data Cloud (LOD)
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 4
Semantic Web
2007 – 12 datasets 2009 – 89 datasets 2011 – 295 datasets
~ 31 billion RDF triples
taken from [http://lod-cloud.net/]
5. The next evolution of the World Wide Web (Web 3.0)?
◦ In fact, the amount of Semantic Data increases exponentially
◦ For example, consider the evolution of the Linked Open Data Cloud (LOD)
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 5
Semantic Web
2007 – 12 datasets 2009 – 89 datasets 2011 – 295 datasets
~ 31 billion RDF triples
taken from [http://lod-cloud.net/]
How to answer
SPARQL queries on
large RDF datasets?
6. Has become a de facto standard in the area of Big Data
Hadoop is the most prominent MapReduce framework
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 6
MapReduce
Hadoop Cluster at Yahoo!
7. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 7
But!?...
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's from and
store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," + s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',', firstComma);
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable, LongWritable,
Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable, Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites for users
18 to 25");
jc.addJob(loadPages);
8. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 8
But!?...
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's from and
store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," + s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',', firstComma);
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable, LongWritable,
Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable, Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites for users
18 to 25");
jc.addJob(loadPages);
Hadoop MapReduce is…
fairly low-level
code-intensive
time-consuming
error-prone
You should be a
Java Developer !
9. 1/20 lines of code
1/16 development time
You don't have to
reinvent the wheel
Same performance as
native MapReduce
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 9
The same in Pig (Latin)
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name,
Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
10. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 10
The same in Pig (Latin)
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name,
Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
1/20 lines of code
1/16 development time
You don't have to
reinvent the wheel
Same performance as
native MapReduce
But!?…
SPARQL is declarative
Pig Latin is imperative
SPARQL is more natural
for RDF than Pig Latin
11. We can express every SPARQL operator by an equivalent
sequence of Pig Latin expressions
That is exactly what PigSPARQL does!
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 11
The good News
12. We can express every SPARQL operator by an equivalent
sequence of Pig Latin expressions
That is exactly what PigSPARQL does!
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 12
The good News
13. We can express every SPARQL operator by an equivalent
sequence of Pig Latin expressions
That is exactly what PigSPARQL does!
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 13
The good News
14. We can express every SPARQL operator by an equivalent
sequence of Pig Latin expressions
That is exactly what PigSPARQL does!
PigSPARQL: A SPARQL Query Processing Baseline for Big Data 14
The good News
15. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 15
Benefits of PigSPARQL
MapReduce PigSPARQL
„Low-level“ to implement User writes query in SPARQL
auto translated into Pig Latin
16. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 16
Benefits of PigSPARQL
MapReduce PigSPARQL
„Low-level“ to implement
No Primitives
everything must be „hand-coded“
User writes query in SPARQL
auto translated into Pig Latin
Based on Pig Latin
no need to reinvent the wheel
17. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 17
Benefits of PigSPARQL
MapReduce PigSPARQL
„Low-level“ to implement
No Primitives
everything must be „hand-coded“
Optimizations
completely up to you
User writes query in SPARQL
auto translated into Pig Latin
Based on Pig Latin
no need to reinvent the wheel
Pig is continuously
optimized and refined
(order of magnitude speedup from
Pig 0.5.0 to 0.11.0)
18. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 18
Benefits of PigSPARQL
MapReduce PigSPARQL
„Low-level“ to implement
No Primitives
everything must be „hand-coded“
Optimizations
completely up to you
Version changes of Hadoop?
User writes query in SPARQL
auto translated into Pig Latin
Based on Pig Latin
no need to reinvent the wheel
Pig is continuously
optimized and refined
(order of magnitude speedup from
Pig 0.5.0 to 0.11.0)
Pig cobes with version issues
19. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 19
Benefits of PigSPARQL
MapReduce PigSPARQL
„Low-level“ to implement
No Primitives
everything must be „hand-coded“
Optimizations
completely up to you
Version changes of Hadoop?
User writes query in SPARQL
auto translated into Pig Latin
Based on Pig Latin
no need to reinvent the wheel
Pig is continuously
optimized and refined
(order of magnitude speedup from
Pig 0.5.0 to 0.11.0)
Pig cobes with version issues
No additional software installation
or changes to Hadoop required
20. PigSPARQL: A SPARQL Query Processing Baseline for Big Data 20
Last but not least
For an interesting discussion, please visit us at the
Posters & Demos Session
ISWC 2013, Sydney, Australia
23rd October 2013
[http://dbis.informatik.uni-freiburg.de/PigSPARQL]