Efficient processing of Rank-aware queries in Map/Reduce

•Download as PPTX, PDF•

0 likes•282 views

Through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing. There is a code which is implemented for this solution.

Software

Need for a new model
 Exponential data growth
 Need for analysis, utilization and scalability of more and more
data
 Need for parallel processing
 Need to reduce reading time and data recovery
 Need for convenience in terms of programmer
 Cost

What is the Map/Reduce?
Distributed data processing programming model
and runtime environment that operates in a large
number of clusters of machines with parallel
processing

Weaknesses in Top-K Join Queries
What is the Top-K Join?
Weaknesses
 Read all the data for the recovery of K results
 Non-equitable distribution of workload per Reducer

Goals of the experiment
 Implementation of Top-K Join queries in
Map/Reduce model in an efficient manner
 Troubleshooting shown in Map / Reduce with:
 Early Termination
 Load Balancing

Design
 Comparison of three algorithms (1 default and 2 new)
 Naive
 EarlyTermination (using bounds)
 EarlyTermination & LoadBalancing (using bounds and Longest
Processing Time)
 Pre-Elaboration
 Production of two data tables with Join attributes
 Statistics for the data in the form of histograms
 Elaboration
 Calculating bounds of histograms for each table
 Run Map/Reduce

Early Termination
Check Bounds EarlyTermRecordReader
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducers
Process

Early Termination & Load Balancing
EarlyTermRecordReader
Check
Bounds
Send Data
Send Data
HDFS
Generated Sorted
Data
Histograms
EarlyTermInputFormat
Mapper
Reducer
CustomPartitioner
Reducer Reducer

Experiment (1)
Parameters Values
Data Distribution: Zipfian
Number of data: 1.000.000 / table
Number of reducers: 10, 6
Number of K results: 10
Data skew: 0, 0.5, 1
Number of Joining Attributes: 10
Max value for data: 10000
Sorting: By score
Histograms: 10 bins
Cluster: 8 machines

Experiment Part – Comparison of algorithms (2)
0:50:24
0:43:12
0:36:00
0:28:48
0:21:36
0:14:24
0:07:12
0:00:00
0 0.5 1
Running time
Skew
REDUCERS = 10
Naive
Early Termination
Early Termination & Load
Balancing

Experiment Part – Comparison of algorithms (3)
2500000
2000000
1500000
1000000
500000
0
0 0.5 1
Number of records
Skew
REDUCERS = 10
Naive
Early termination
Early termination & Load Balancing

Experiment Part – Comparison of algorithms (4)
0:17:17
0:14:24
0:11:31
0:08:38
0:05:46
0:02:53
0:00:00
6 10
Running time
Number of Reducers
REDUCERS = 6
Early Termination
Early Termination & Load Balancing

Conclusion
By using the techniques proposed: :
 Early Termination
 Load Balancing
is possible to implement rank aware queries (Top-K) in
Map / Reduce efficiently and solving disadvantages of
the model Map / Reduce

This document discusses using Spark Streaming and GraphX to perform near-realtime analytics on large distributed systems. The authors present a model-driven approach to implement Pregel-style graph processing to handle heterogeneous graphs. They were able to achieve over 100,000 messages per second on a 4 node cluster by using sufficient batch sizes. Implementation challenges included scaling graph processing across nodes, dealing with graph heterogeneity, and hidden memory costs from intermediate RDDs. Lessons learned include the importance of partitioning, testing high availability, and addressing memory sinks.

3D Analyst - Cut and Fill

Hartanto Sanjaya

This document outlines the steps taken to determine material loss in the Grasberg area of Papua caused by private company exploration using a 3D analysis technique called cut and fill. The analysis involved generating elevation data points from SRTM data, converting the points to a vector file, creating a TIN surface, and executing a cut and fill between two TINs to calculate the volume of material loss in cubic meters.

CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...

The Statistical and Applied Mathematical Sciences Institute

While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.

Migration of groups of virtual machines in distributed data centers to reduce...

Sabidur Rahman

This document summarizes a research paper that proposes an algorithm to select groups of virtual machines (VMs) to migrate across distributed data centers in order to reduce energy costs. The algorithm aims to migrate VMs with network proximity to consolidate workloads on fewer physical servers, allowing underutilized servers to be powered off. It evaluates VM groups rather than entire data center workloads. The algorithm selects potential VM migration sets and negotiates migrations between data centers. Simulation results show the topology-aware algorithm achieves greater energy savings than migrating VMs randomly without considering network effects.

Optimization of Resource Provisioning Cost in Cloud Computing

Aswin Kalarickal

In cloud computing, cloud providers can offer cloud consumers two provisioning plans for computing resources, namely reservation and on‐demand plans. In general, cost of utilizing computing resources provisioned by reservation plan is cheaper than that provisioned by on‐demand plan, since cloud consumer has to pay to provider in advance. With the reservation plan, the consumer can reduce the total resource provisioning cost. However, the best advance reservation of resources is difficult to be achieved due to uncertainty of consumer's future demand and providers' resource prices. To address this problem, an optimal cloud resource provisioning (OCRP) algorithm is proposed by formulating a stochastic programming model. The OCRP algorithm can provision computing resources for being used in multiple provisioning stages as well as a long‐term plan, e.g., four stages in a quarter plan and twelve stages in a yearly plan. The demand and price uncertainty is considered in OCRP. In this paper, different approaches to obtain the solution of the OCRP algorithm are considered including deterministic equivalent formulation, sample‐average approximation, and Benders decomposition. Numerical studies are extensively performed in which the results clearly show that with the OCRP algorithm, cloud consumer can successfully minimize total cost of resource provisioning in cloud computing environments.

Cloud computing provides outsourced computing infrastructure and tools like Hadoop and Dryad for data-parallel processing. Commercial clouds are proprietary but open-source versions exist. Building open-architecture clouds requires understanding hardware, virtualization, services, and runtimes best practices. Cloud runtimes can run data-file parallel and dataflow applications at large scales for problems in areas like biology, geospatial processing, and clustering. Deterministic annealing is a parallelizable algorithm for data clustering that has been run on clouds. Clouds may change scientific computing by providing controllable, sustainable infrastructure without local clusters.

Project Matsu: Elastic Clouds for Disaster Relief

Robert Grossman

The document discusses Project Matsu, an initiative by the Open Cloud Consortium to provide cloud computing resources for large-scale image processing to assist with disaster relief. It proposes three technical approaches: 1) Using Hadoop and MapReduce to process images in parallel across nodes; 2) Using Hadoop streaming with Python to preprocess images into a single file for processing; and 3) Using the Sector distributed file system and Sphere UDFs to process images while keeping them together on nodes without splitting files. The overall goal is to enable elastic computing on petabyte-scale image datasets for change detection and other analyses to support disaster response.

Jovian Data Amazon Final Version

Satya Ramachandran

- JovianDATA provides a cloud-based analytics platform that transforms large volumes of advertising, search, and sales data into actionable insights at low cost. - The platform uses role-based temporary clusters on AWS EC2 to reduce capex, with dynamic provisioning and selective data replication for on-demand high performance analytics at a fraction of the cost of traditional architectures. - It further reduces costs through application isolation techniques like hibernating unused applications on Amazon S3/EBS and provisioning them in parallel on EC2 when needed, saving up to 100x on infrastructure costs compared to always-on architectures.

Murphy presentation

COGS Presentations

This document summarizes a presentation on assessing the accuracy of LiDAR data using ArcGIS 10.1. The goals were to determine if ArcGIS could accurately assess LiDAR data by comparing it to check points based on 8 statistics. It discusses the history of LiDAR, how it is handled in ArcGIS, and compares LAS datasets to terrain datasets. The code structure calculates residuals and statistics to output accuracy measurements to assess if the data meets ASPRS and USGS guidelines. In conclusion, ArcGIS can visually inspect LiDAR but other software is needed for full analysis capabilities.

Bioclouds CAMDA (Robert Grossman) 09-v9p

Robert Grossman

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Jen Aman

Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.

K venkata reddy

ClimDev15

This document describes a geospatial modeling tool developed to retrieve climate data from large climate model databases in an efficient manner. The tool integrates R programming with ArcGIS to subset and extract grid point data for specific study areas from netCDF climate model files. It was tested on CORDEX climate model data and found to accurately obtain grid points, providing a less tedious method than manual retrieval. The tool allows climate data to be efficiently obtained and prepared as model inputs.

Geospatial Sensor Networks and Partitioning Data

AlexMiowski

We use resources like weather reports or air quality measurements to navigate the world. These resources become especially important when faced by extreme events like the current wildfires in the Western USA. The data for the reports, predictions, and maps all start as realtime sensor networks. In this presentation, I look at some of my research into scientific data representation on the Web and how the key mechanism is the partitioning, annotation, and naming of data representations. We’ll take a look at a few examples, including some recent work on air quality data relating to the current wildfires in the western USA. We’ll explore the central question of how geospatial sensor network data can be collected and consumed within K8s deployments.

How to Reduce Your Database Total Cost of Ownership with TimescaleDB

Timescale

Looking to minimize costs or get more from your database investments? See what’s new in TimescaleDB 1.5, including how to use native compression and data tiering to significantly improve your database performance -- and tightly manage costs. During this video, you will: - See how native compression allows you to store more data in less actual storage - Understand how to use move chunks to optimize performance and cut costs - Hear how TimescaleDB users get 20x compression ratios - Learn about the resources and best practices you need to get started

Leveraging Map Reduce With Hadoop for Weather Data Analytics

iosrjce

IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.

SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool

SoftwareHut

For Calnex Solutions, a global leader in synchronisation and OAM (Operation and Maintenance) testing, Order Of Code team (now a part of SoftwareHut), re-developed Calnex Analysis Tool, and implemented new algorithm for testing transmission parameters in ICT networks. Developing Calnex Analysis Tool (CAT), Calnex wanted to offer a desktop app supporting their hardware. Due to performance and user experience issues, the CAT software didn’t meet its goals, discouraging buyers from choosing Calnex’s products.

Weather Data Analytics Using Hadoop

Najima Begum

This document discusses using Hadoop to analyze weather data. It analyzes weather data from the National Climatic Data Center to find the maximum temperature for each year. The Hadoop architecture includes HDFS for distributed storage and MapReduce for parallel processing. Weather records are input to mappers and key-value pairs of years and temperatures are output. Reducers take these pairs to find the maximum temperature for each year, such as 280 for 2010. The document concludes Hadoop is well-suited for processing large sensor data sets like weather data stored at NCDC.

Tutorial5

ShwetaPolicepatil

MapReduce is a programming model for processing large datasets in a distributed system. It allows for automatic parallelization and distribution of work. The MapReduce model consists of a map step that processes key-value pairs to generate intermediate key-value pairs, and a reduce step that merges all intermediate values associated with the same intermediate key. As an example, a word count problem can be solved by mapping words to counts, then reducing by word to get the total count for each word. Hadoop is an open-source implementation of MapReduce that provides fault tolerance and locality optimizations for distributed processing of large datasets across clusters.

Pdcs2010 balman-presentation

balmanme

1) The Earth System Grid (ESG) supports climate research by providing access to petabytes of climate simulation data distributed across multiple locations worldwide. 2) As climate datasets continue increasing in size, from gigabytes to petabytes, efficient bulk data transfer techniques are needed to replicate and distribute the data. 3) The Bulk Data Mover (BDM) was developed to improve data transfer performance. It uses techniques like parallel TCP streams, adaptive tuning of transfer parameters, and dynamic load balancing.

OCC Overview OMG Clouds Meeting 07-13-09 v3

Robert Grossman

Large-Scale Geographically Weighted Regression on Spark

Viet-Trung TRAN

Geographically Weighted Regression (GWR) is a local version of spatial regression that captures spatial dependency in regression analysis. GWR has many application in practice as a visualization and prediction tool for spatial exploration- (e.g in climate, economy, medical). However, this locally regression model is slow in process upon the volume of calculations and the spatial getting bigger. Improving performance of GWR is an critical issue, but their distributed implementations have not been studied. Recently, with the advent of Spark as well MapReduce framework, the development of machine learning applications and parallel programming becomes easier. In this article, we propose several large-scale implementations of distributed GWR, leveraging Spark framework. We implemented and evaluated these approaches with large datasets. To our best knowledge, this is the first work addressing GWR at large-scale.

Team3 presentation

Amanda Gilbert

The document summarizes a Think Big Bootcamp project involving the ingestion and preliminary analysis of aircraft registry data from the FAA. It describes how the data was ingested using Python and Hadoop, then loaded into Hive tables. Initial exploration found the most frequently reported crafts and analyzed acceptance rates. Site comparison showed differences in average speed and altitude between two sites. Master data queries were created to summarize models, aircraft, and owners. Data visualizations analyzed fastest planes, speed vs altitude by make, unique flights by airline, and number of sightings by aircraft make.

FreshJealous Fall/Winter Collection '16

Ana Castanho

The Geometry of Peaks collection by FreshJealous was inspired by the mysterious town and landscapes featured in the TV show Twin Peaks. The collection explores geometric shapes and loops alongside profound colors and a grungy aesthetic reminiscent of the show. Materials include raincoats, cloth, and cotton knits crafted with symbolism to capture the gloomy mountain feel and intrigue of the Twin Peaks world.

Bucură te tinere

Pruna Laurentiu

Ok money’s site design

Marta W

Problemsin adolescence reference

Abhishek Kulshreshtha

This document discusses issues of alienation and disconnection in adolescence. It describes forms of alienation such as normlessness, powerlessness, social isolation, and meaninglessness. Adolescents who are poor and minority experience the greatest alienation. Being faced with responsibility but lacking authority can lead to disconnectedness. Outcomes of alienation may include internalizing problems like anxiety and depression or externalizing problems like aggression. Substance abuse is discussed as a form of self-medication for emotional distress.

What's hot

Delegating Data Management to the Cloud: A Case Study in a Telecommunications...

Giuseppe Procaccianti

Slide 1

butest

Project Matsu: Elastic Clouds for Disaster Relief

Robert Grossman

Jovian Data Amazon Final Version

Satya Ramachandran

Murphy presentation

COGS Presentations

Bioclouds CAMDA (Robert Grossman) 09-v9p

Robert Grossman

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Jen Aman

K venkata reddy

ClimDev15

Geospatial Sensor Networks and Partitioning Data

AlexMiowski

How to Reduce Your Database Total Cost of Ownership with TimescaleDB

Timescale

Leveraging Map Reduce With Hadoop for Weather Data Analytics

iosrjce

SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool

SoftwareHut

Weather Data Analytics Using Hadoop

Najima Begum

Tutorial5

ShwetaPolicepatil

Pdcs2010 balman-presentation

balmanme

OCC Overview OMG Clouds Meeting 07-13-09 v3

Robert Grossman

Large-Scale Geographically Weighted Regression on Spark

Viet-Trung TRAN

Team3 presentation

Amanda Gilbert

What's hot (18)

Delegating Data Management to the Cloud: A Case Study in a Telecommunications...

Slide 1

Project Matsu: Elastic Clouds for Disaster Relief

Jovian Data Amazon Final Version

Murphy presentation

Bioclouds CAMDA (Robert Grossman) 09-v9p

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

K venkata reddy

Geospatial Sensor Networks and Partitioning Data

How to Reduce Your Database Total Cost of Ownership with TimescaleDB

Leveraging Map Reduce With Hadoop for Weather Data Analytics

SoftwareHut | Case Study | Calnex | Improving Calnex Analysis Tool

Weather Data Analytics Using Hadoop

Tutorial5

Pdcs2010 balman-presentation

OCC Overview OMG Clouds Meeting 07-13-09 v3

Large-Scale Geographically Weighted Regression on Spark

Team3 presentation

Viewers also liked

FreshJealous Fall/Winter Collection '16

Ana Castanho

Bucură te tinere

Pruna Laurentiu

Ok money’s site design

Marta W

Problemsin adolescence reference

Abhishek Kulshreshtha

Roy dolinerAna Castanho

Indiani x kosmosAna Castanho

Tthornton code4lib

trevorthornton

This document describes a system for managing archival finding aids without using XSLT. It uses Ruby on Rails, MySQL, and SOLR to enable multiple presentations of data, support dynamic applications, and allow cross-collection search. Collections have components in a hierarchical structure. Descriptions are organized by ISAD(G) elements and stored in JSON format. EAD is used as a guide to structure data storage. An API facilitates interaction. A prototype public finding aid was created to demonstrate the system.

Linked data for librarians

trevorthornton

The document provides an overview of linked data fundamentals, including key concepts like URIs, RDF, ontologies, and the semantic web. It discusses aspects of linked data such as using HTTP URIs to identify resources, representing data as subject-predicate-object triples, and connecting related resources through links. It also covers RDF serialization formats, ontologies like RDFS and OWL, and notable linked open data sources.

Linked Open Data Fundamentals for Libraries, Archives and Museums

trevorthornton

This document provides an overview of linked open data concepts for libraries, archives, and museums. It discusses what linked open data is, potential benefits for cultural institutions, and technical concepts like URIs, HTTP, RDF, ontologies, and SPARQL. The document also covers publishing linked open data by establishing URIs for resources and using content negotiation. Trust and attribution of linked data sources are addressed. Open data licensing, including options from Creative Commons, is also summarized.

An Introduction to Open Source Software and Web Application Development

trevorthornton

задротенvaldis82

O lectie de patriotism local

lazardiana

задротенvaldis82

El ciberassetjamentjordiysubanda

Задротенvaldis82

El ciberassetjamentjordiysubanda

Tarian adat di indonesia

amaruf

Viewers also liked (17)

FreshJealous Fall/Winter Collection '16

Bucură te tinere

Ok money’s site design

Problemsin adolescence reference

Roy doliner

Indiani x kosmos

Tthornton code4lib

Linked data for librarians

Linked Open Data Fundamentals for Libraries, Archives and Museums

An Introduction to Open Source Software and Web Application Development

задротен

O lectie de patriotism local

задротен

El ciberassetjament

Задротен

El ciberassetjament

Tarian adat di indonesia

Similar to Efficient processing of Rank-aware queries in Map/Reduce

Download It

butest

The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11

David Ribeiro Alves

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Xiao Qin

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisﬁed in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

SnapLogic

In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations. To learn more, visit: http://www.snaplogic.com/redshift-trial

DIET_BLAST

Frederic Desprez

Sawmill - Integrating R and Large Data Clouds

Robert Grossman

This document discusses using R for large-scale data analysis on distributed data clouds. It recommends splitting large datasets into segments using MapReduce or UDFs, then building separate models for each segment in R. PMML can be used to combine the separate models into an ensemble model. The Sawmill framework is proposed to preprocess data in parallel, build models for each segment using R, and combine the models into a PMML file for deployment. Running R on each segment sequentially allows scaling to large datasets, with examples showing processing times for different numbers of segments.

Apache Lens at Hadoop meetup

amarsri

Apache Lens is a unified analytics platform that enables multi-dimensional queries over datasets stored in multiple data warehouses like Hadoop and columnar databases. It provides a single metadata layer and OLAP cube abstraction to allow for data discovery and unified access across data sources. Lens uses a distributed architecture and can push queries to where data resides for efficient processing.

Qiu bosc2010

BOSC 2010

This document summarizes cloud technologies and their applications in life sciences. It discusses how cloud computing can help address challenges posed by big data through cost-effective data centers, hiding complexity, and parallel computing frameworks like MapReduce. Specific applications highlighted include DNA sequence assembly, metagenomics, and correlating health data with environmental factors. Frameworks like Hadoop, DryadLINQ, and Twister are examined for processing large-scale biological data on clouds.

Distributed approximate spectral clustering for large scale datasets

Bita Kazemi

The document proposes a distributed approximate spectral clustering (DASC) algorithm to process large datasets in a scalable way. DASC uses locality sensitive hashing to group similar data points and then approximates the kernel matrix on each group to reduce computation. It implements DASC using MapReduce and evaluates it on real and synthetic datasets, showing it can achieve similar clustering accuracy to standard spectral clustering but with an order of magnitude better runtime by distributing the computation across clusters.

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...

Ian Foster

This document discusses computing challenges posed by rapidly increasing data scales in scientific applications and high performance computing. It introduces the concept of online data analysis and reduction as an alternative to traditional offline analysis to help address these challenges. The key messages are that dramatic changes in HPC system geography due to different growth rates of technologies are driving new application structures and computational logistics problems, presenting exciting new computer science opportunities in online data analysis and reduction.

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

Spark Summit

The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.

Presentation_BigData_NenaMarin

n5712036

Nena Marín presents solutions for analyzing large datasets from internet advertising. She discusses building a recommender system using co-clustering that was trained on over 100 million ratings in under 17 minutes. For attribution reporting, pre-aggregated metrics are deployed to a GUI within 20 minutes for weekly reports. Lessons learned include addressing data quality, performance baselines, schema flexibility, and integration challenges.

CS 542 -- Query Execution

J Singh

The document discusses query execution in database management systems. It begins with an example query on a City, Country database and represents it in relational algebra. It then discusses different query execution strategies like table scan, nested loop join, sort merge join, and hash join. The strategies are compared based on their memory and disk I/O requirements. The document emphasizes that query execution plans can be optimized for parallelism and pipelining to improve performance.

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off

Timescale

The earliest relational databases were monolithic on-premise systems that were powerful and full-featured. Fast forward to the Internet and NoSQL: BigTable, DynamoDB and Cassandra. These distributed systems were built to scale out for ballooning user bases and operations. As more and more companies vied to be the next Google, Amazon, or Facebook, they too "required" horizontal scalability. But in a real way, NoSQL and even NewSQL have forgotten single node performance where scaling out isn't an option. And single node performance is important because it allows you to do more with much less. With a smaller footprint and simpler stack, overhead decreases and your application can still scale. In this talk, we describe TimescaleDB's methods for single node performance. The nature of time-series workloads and how data is partitioned allows users to elastically scale up even on single machines, which provides operational ease and architectural simplicity, especially in cloud environments.

Optimization of Continuous Queries in Federated Database and Stream Processin...

Zbigniew Jerzak

The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries. In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.

Distributed computing poli

ivascucristian

The document provides an overview of distributed computing and related technologies. It discusses the history of distributed computing including local, parallel, grid and distributed computing. It then discusses applications of distributed computing like web indexing and recommendations. The document introduces Hadoop and its core components HDFS and MapReduce. It also discusses related technologies like HBase, Mahout and challenges in designing distributed systems. It provides examples of using Mahout for machine learning tasks like classification, clustering and recommendations.

Hui 3.0

Arulkumar Arumugam

SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013

James McGalliard

This document discusses various workload scheduling alternatives for high performance computing environments. It begins by describing typical HPC workloads and challenges in scheduling large parallel jobs. It then covers scheduling techniques like backfill and frameworks like MapReduce and Hadoop. Alternative prioritization methods are proposed, like prioritizing based on estimated run time, wait time, or number of processors requested. The document concludes by showing results comparing different dynamic prioritization approaches.

Scalable analytics for iaas cloud availability

Papitha Velumani

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010

ivan provalov

Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team. Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them. Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.

Similar to Efficient processing of Rank-aware queries in Map/Reduce (20)

Download It

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

DIET_BLAST

Sawmill - Integrating R and Large Data Clouds

Apache Lens at Hadoop meetup

Qiu bosc2010

Distributed approximate spectral clustering for large scale datasets

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

Presentation_BigData_NenaMarin

CS 542 -- Query Execution

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off

Optimization of Continuous Queries in Federated Database and Stream Processin...

Distributed computing poli

Hui 3.0

SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013

Scalable analytics for iaas cloud availability

Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010

Recently uploaded

Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...

kalichargn70th171

A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Neo4j

Artificia Intellicence and XPath Extension Functions

Octavian Nadolu

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

Green Software Development

Graspan: A Big Data System for Big Code Analysis

Aftab Hussain

We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18. - Accepted in ASPLOS ‘17, Xi’an, China. - Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17. - Invited for presentation at SoCal PLS ‘16. - Invited for poster presentation at PLDI SRC ‘16.

Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition

Envertis Software Solutions

Odoo ERP software Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth. The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently. This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.

DDS-Security 1.2 - What's New? Stronger security for long-running systems

Gerardo Pardo-Castellote

Webinar On-Demand: Using Flutter for Embedded

ICS

Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.

Empowering Growth with Best Software Development Company in Noida - Deuglo

Deuglo Infosystem Pvt Ltd

Do you want Software for your Business? Visit Deuglo Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions. Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC). Requirement — Collecting the Requirements is the first Phase in the SSLC process. Feasibility Study — after completing the requirement process they move to the design phase. Design — in this phase, they start designing the software. Coding — when designing is completed, the developers start coding for the software. Testing — in this phase when the coding of the software is done the testing team will start testing. Installation — after completion of testing, the application opens to the live server and launches! Maintenance — after completing the software development, customers start using the software.

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

Energy consumption of Database Management - Florina Jonuzi

Green Software Development

KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD

rodomar2

How to write a program in any programming language

Rakesh Kumar R

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

Peter Muessig

The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.

What is Augmented Reality Image Tracking

pavan998932

Using Xen Hypervisor for Functional Safety

Ayan Halder

SWEBOK and Education at FUSE Okinawa 2024

Hironori Washizaki

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

Oracle Database 19c New Features for DBAs and Developers.pptx

Remote DBA Services

Transform Your Communication with Cloud-Based IVR Solutions

TheSMSPoint

Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony

Recently uploaded (20)

Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Artificia Intellicence and XPath Extension Functions

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

Graspan: A Big Data System for Big Code Analysis

Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition

DDS-Security 1.2 - What's New? Stronger security for long-running systems

Webinar On-Demand: Using Flutter for Embedded

Empowering Growth with Best Software Development Company in Noida - Deuglo

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

Energy consumption of Database Management - Florina Jonuzi

KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD

How to write a program in any programming language

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

What is Augmented Reality Image Tracking

Using Xen Hypervisor for Functional Safety

SWEBOK and Education at FUSE Okinawa 2024

May Marketo Masterclass, London MUG May 22 2024.pdf

Oracle Database 19c New Features for DBAs and Developers.pptx

Transform Your Communication with Cloud-Based IVR Solutions

Efficient processing of Rank-aware queries in Map/Reduce

1. EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE OIKONOMAKIS SPYRIDON SOF TWARE / ENGINEER AT PEOPLEPERHOUR

2. Need for a new model  Exponential data growth  Need for analysis, utilization and scalability of more and more data  Need for parallel processing  Need to reduce reading time and data recovery  Need for convenience in terms of programmer  Cost

3. What is the Map/Reduce? Distributed data processing programming model and runtime environment that operates in a large number of clusters of machines with parallel processing

4. Is the Map/Reduce model reliable?

5. Map/Reduce

6. Weaknesses in Top-K Join Queries What is the Top-K Join? Weaknesses  Read all the data for the recovery of K results  Non-equitable distribution of workload per Reducer

7. Goals of the experiment  Implementation of Top-K Join queries in Map/Reduce model in an efficient manner  Troubleshooting shown in Map / Reduce with:  Early Termination  Load Balancing

8. Design  Comparison of three algorithms (1 default and 2 new)  Naive  EarlyTermination (using bounds)  EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)  Pre-Elaboration  Production of two data tables with Join attributes  Statistics for the data in the form of histograms  Elaboration  Calculating bounds of histograms for each table  Run Map/Reduce

9. Design(2)

10. Early Termination Check Bounds EarlyTermRecordReader Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducers Process

11. Early Termination & Load Balancing EarlyTermRecordReader Check Bounds Send Data Send Data HDFS Generated Sorted Data Histograms EarlyTermInputFormat Mapper Reducer CustomPartitioner Reducer Reducer

12. Experiment (1) Parameters Values Data Distribution: Zipfian Number of data: 1.000.000 / table Number of reducers: 10, 6 Number of K results: 10 Data skew: 0, 0.5, 1 Number of Joining Attributes: 10 Max value for data: 10000 Sorting: By score Histograms: 10 bins Cluster: 8 machines

13. Experiment Part – Comparison of algorithms (2) 0:50:24 0:43:12 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 0 0.5 1 Running time Skew REDUCERS = 10 Naive Early Termination Early Termination & Load Balancing

14. Experiment Part – Comparison of algorithms (3) 2500000 2000000 1500000 1000000 500000 0 0 0.5 1 Number of records Skew REDUCERS = 10 Naive Early termination Early termination & Load Balancing

15. Experiment Part – Comparison of algorithms (4) 0:17:17 0:14:24 0:11:31 0:08:38 0:05:46 0:02:53 0:00:00 6 10 Running time Number of Reducers REDUCERS = 6 Early Termination Early Termination & Load Balancing

16. Conclusion By using the techniques proposed: :  Early Termination  Load Balancing is possible to implement rank aware queries (Top-K) in Map / Reduce efficiently and solving disadvantages of the model Map / Reduce

17. Questions ???? Thank you.

Efficient processing of Rank-aware queries in Map/Reduce

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (17)

Similar to Efficient processing of Rank-aware queries in Map/Reduce

Similar to Efficient processing of Rank-aware queries in Map/Reduce (20)

Recently uploaded

Recently uploaded (20)

Efficient processing of Rank-aware queries in Map/Reduce