Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well.
Indeed, existing SPARQL-on-Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data.
In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
Talk given at the London Semantic Web Meetup February 17th 2015. Discusses projects that aim to bridge the gap between the RDF and the Hadoop ecosystems primarily focusing on Apache Jena Elephas.
A talk given at SemTechBiz 2014 in San Jose that follows up on the tool originally presented at the 2012 conference. Talks about the limitations we've encountered with the original tool and how we've evolved it to address these and build a more robust general purpose and open source SPARQL testing tool.
The tool is available on SourceForge in pre-built form at http://sourceforge.net/projects/sparql-query-bm/ or as code on SourceForge or GitHub (https://github.com/rvesse/sparql-query-bm)
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
Talk given at the London Semantic Web Meetup February 17th 2015. Discusses projects that aim to bridge the gap between the RDF and the Hadoop ecosystems primarily focusing on Apache Jena Elephas.
A talk given at SemTechBiz 2014 in San Jose that follows up on the tool originally presented at the 2012 conference. Talks about the limitations we've encountered with the original tool and how we've evolved it to address these and build a more robust general purpose and open source SPARQL testing tool.
The tool is available on SourceForge in pre-built form at http://sourceforge.net/projects/sparql-query-bm/ or as code on SourceForge or GitHub (https://github.com/rvesse/sparql-query-bm)
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
(Abstract from Strata talk)
http://strataconf.com/strata2014/public/schedule/detail/32137
Graph analytics have applications beyond large web scale organizations. Many computing problems can be efficiently expressed and processed as a graph and can lead to useful insights that drive product and business decisions
While you can express graph algorithms as SQL queries in Hive or Hadoop MapReduce programs, an API designed specifically for graph processing makes writing many iterative graph computations (such as page rank, connected components, label propagation, graph-based clustering, etc.) easy to express in simpler and easier to understand code. Apache Giraph provides such a native graph processing API, runs on existing Hadoop infrastructure and can directly access HDFS and/or Hive tables.
This talk describes our efforts at Facebook to scale Apache Giraph to very large graphs of up to one trillion edges and how we run Apache Giraph in production. We will also talk about several algorithms that we have implemented and their use cases.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
Reinforcement learning (RL) algorithms involve the deep nesting of distinct components, where each component typically exhibits opportunities for distributed computation. Current RL libraries offer parallelism at the level of the entire program, coupling all the components together and making existing implementations difficult to extend, combine, and reuse.
We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLLib on top of the the Ray distributed execution engine and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLLib matches or exceeds the performance of highly optimized reference implementations.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
(Abstract from Strata talk)
http://strataconf.com/strata2014/public/schedule/detail/32137
Graph analytics have applications beyond large web scale organizations. Many computing problems can be efficiently expressed and processed as a graph and can lead to useful insights that drive product and business decisions
While you can express graph algorithms as SQL queries in Hive or Hadoop MapReduce programs, an API designed specifically for graph processing makes writing many iterative graph computations (such as page rank, connected components, label propagation, graph-based clustering, etc.) easy to express in simpler and easier to understand code. Apache Giraph provides such a native graph processing API, runs on existing Hadoop infrastructure and can directly access HDFS and/or Hive tables.
This talk describes our efforts at Facebook to scale Apache Giraph to very large graphs of up to one trillion edges and how we run Apache Giraph in production. We will also talk about several algorithms that we have implemented and their use cases.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
Reinforcement learning (RL) algorithms involve the deep nesting of distinct components, where each component typically exhibits opportunities for distributed computation. Current RL libraries offer parallelism at the level of the entire program, coupling all the components together and making existing implementations difficult to extend, combine, and reuse.
We argue for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. We demonstrate this principle by building Ray RLLib on top of the the Ray distributed execution engine and show that we can implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. This composability does not come at the cost of performance — in our experiments, RLLib matches or exceeds the performance of highly optimized reference implementations.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks.
This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with their respective 1 and 2 versions under distinct configurations including Tez, Mahout, MLlib. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and logfile analysis. The talk concludes with the main findings, the scalability, and limits of each framework.
Originally presented at: https://dataworkssummit.com/munich-2017/sessions/using-bigbench-to-compare-hive-and-spark-versions-and-features/
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
Similar to Sempala - Interactive SPARQL Query Processing on Hadoop (20)
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
ER(Entity Relationship) Diagram for online shopping - TAEHimani415946
https://bit.ly/3KACoyV
The ER diagram for the project is the foundation for the building of the database of the project. The properties, datatypes, and attributes are defined by the ER diagram.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
test test test test testtest test testtest test testtest test testtest test ...
Sempala - Interactive SPARQL Query Processing on Hadoop
1. Sempala
Interactive SPARQL Query Processing
on Hadoop
Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen
University of Freiburg, Germany
ISWC 2014 - Riva del Garda, Italy
2. 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 2
Motivation
Semantic Web has arrived in real-world applications
(not only academia & research)
• Web-scale semantic data makes single machine solutions infeasible
Hadoop ecosystem has become de-facto standard for
Big Data applications
Our Idea: Use it for Semantic Web purposes as well
• 2 main reasons:
1) Web-scale semantic data requires solutions that scale out
2) Industry has settled on Hadoop (or Hadoop-style) architectures
superior cost-benefit ratio compared to specialized infrastructures
3. Motivation
SPARQL-on-Hadoop
• Existing solutions are mostly based on MapReduce
• Scale very well to billions of triples (or more)
• But MapReduce is batch and thus inherently slow
• Good for unselective ETL-like queries with many results
SPARQL queries are often explorative and ad-hoc
• Typically selective thus returning only a few results
Runtimes in the order of hours are not satisfying!
• But interactive runtimes are virtually impossible
to achieve with MapReduce
Need for interactive SPARQL-on-Hadoop
• Following the trend of interactive SQL-on-Hadoop
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 3
4. Sempala
SPARQL-on-Hadoop query engine
• Especially designed with ad-hoc style selective queries in mind
• Interactive query runtimes for such queries
• Built on top of Impala, an MPP SQL query engine for Hadoop
• RDF data stored in HDFS using columnar storage format (Parquet)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 4
5. Sempala – Data Layout
Single unified property table
• Parquet is a column-oriented storage format
optimized for wide tables while selecting only a few columns on request
• Sparse columns cause little to no storage overhead in Parquet
• No clustering, no joins for star pattern queries
• Duplication strategy for multi-valued properties
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 5
6. Sempala – Data Layout
Single unified property table
• Parquet is a column-oriented storage format
optimized for wide tables while selecting only a few columns on request
• Sparse columns cause little to no storage overhead in Parquet
• No clustering, no joins for star pattern queries
• Duplication strategy for multi-valued properties
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 6
(Article2, 2)
run-length encoding
7. Sempala – Query Compiler
BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 7
8. Sempala – Query Compiler
BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 8
triple group tg1
triple group tg2
9. Sempala – Query Compiler
BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 9
triple group tg1
triple group tg2
join group jg1
10. Sempala – Query Compiler
BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 10
triple group tg1
triple group tg2
join group jg1
tg1
tg2
11. Sempala – Query Compiler
BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 11
triple group tg1
triple group tg2
jg1
tg1
tg2
join group jg1
12. Sempala – Query Compiler
Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 12
13. Sempala – Query Compiler
Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 13
14. Sempala – Query Compiler
Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 14
15. Evaluation
Small Cluster with low-end configuration
• 10 machines, 32 GB RAM and 2 disks each, Gigabit network
(Cloudera recommends 256 GB RAM and 12 disks or more)
• CDH 4.5, Impala 1.2.3
LUBM and BSBM benchmarks
• LUBM up to 3K universities
• BSBM up to 3M products
Compared Sempala with 4 other Hadoop based systems
• Hive (same Query Compiler as Sempala but Hive as execution engine)
• PigSPARQL (built on top of Apache Pig) [1]
• MapMerge (map-side merge join for SPARQL BGPs) [2]
• MAPSIN (map-side index nested loop join based on Apache HBase) [3]
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 15
16. Evaluation
LUBM 3K (sec in log scale)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 16
17. Evaluation
LUBM 3K (sec in log scale)
Geometric Mean for selective (star-shaped) queries:
• Sempala (8.3) , Hive (89) , PigSPARQL (69.7) , MAPSIN (78.1) , MapMerge (65.8)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 17
18. Evaluation
LUBM 3K (sec in log scale)
Geometric Mean for more sophisticated queries:
• Sempala (20.7) , Hive (316.6) , PigSPARQL (266) , MAPSIN (119.5) , MapMerge (175.6)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 18
19. Evaluation
LUBM 3K (sec in log scale)
Geometric Mean for unselective queries:
• Sempala (63.5) , Hive (166) , PigSPARQL (52.4) , MAPSIN (101.4) , MapMerge (24.5)
• Runtime of Sempala dominated by storing millions of records in result table
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 19
20. Summary
Sempala SPARQL query engine for Hadoop
• Built on top of a state-of-the-art MPP SQL query engine (Impala)
• Uses a state-of-the-art columnar storage format (Parquet)
• SPARQL 1.0 (not only BGPs)
Evaluation
• Outperforms existing Hadoop based systems by an order of magnitude
• Interactive runtimes for selective queries (selective ≠ simple)
(order of seconds, not minutes or even hours)
Future Work
• Refinement of data layout (nested data structures Impala 2.1)
• SPARQL 1.1 features (e.g. subqueries, aggregations)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 20
21. References
[1] Schätzle, A. et al.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1-4:8
[2] Przyjaciel-Zablocki, M. et al.: Map-Side Merge Joins for Scalable SPARQL BGP Processing.
In: CloudCom 2013, pp. 631-638
[3] Schätzle, A. et al.: Cascading Map-Side Joins over HBase for Scalable Join Processing.
In: SSWS+HPCSW 2012, pp. 59-74
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 21