Map Reduce

•Download as ODP, PDF•

0 likes•270 views

This document provides an overview of MapReduce, including: what MapReduce is and how it works through mapping and reducing large datasets in parallel across clusters; how data flows through Map and Reduce functions; comparisons of different MapReduce models like Google and Hadoop; and a demonstration using Java streams and lambda expressions. References are also provided for further reading.

Education

© RAGINIJAIN CC SA 4.0
Ragini Jain
MSc CA 1st
Year (2015 - 2017)
Map Reduce

© RAGINIJAIN CC SA 4.0
Overview
● What is Map Reduce
● Map Reduce schematic
● Map Reduce in detail
● Comparison of Map Reduce models
● Demo
● References

© RAGINIJAIN CC SA 4.0
What is Map Reduce
● A software framework which supports
– Parallel
– Distributed computing
on large data sets.
● The framework abstracts the data flow of running a parallel
program on a distributed computing system by providing users
with two interfaces in the form of functions:
– Map
– Reduce
● Users can control and manipulate the data flow of their programs
by overriding the Map() and Reduce() function
● Map Reduce library is the controller.

© RAGINIJAIN CC SA 4.0
Map – reduce schematic
Source: jeremykyun

© RAGINIJAIN CC SA 4.0
Map – reduce schematic (2)
Source: hadoop project

© RAGINIJAIN CC SA 4.0
Map Reduce (in detail)
● The Map function is applied in parallel to every input (key, value)
pair and produces new set of intermediate (key, value) pairs
(key1, val1) ------(map function)---> List (key2, val2)
● Then the MapReduce library collects all the produced intermediate
(key, value) pairs from all input (key, val) pairs and sorts them
based on the key part
● Finally Reduce function is applied in parallel to each group
producing the collection of values
(key2, List(val2)) -----(reduce function) ---> List (val2)

© RAGINIJAIN CC SA 4.0
Map Reduce (as a query framework)
● SQL clauses that are the building block for Map Reduce
operations on structured data and data warehouses
– GROUP BY
– ORDER BY
● On a very large set of demographic data
         SELECT age, AVG(contacts)
             FROM social.person
         GROUP BY age
         ORDER BY age

© RAGINIJAIN CC SA 4.0
GROUP BY (SQL vs Pig)

© RAGINIJAIN CC SA 4.0
Comparison Map Reduce models
● Google Map Reduce
– Prog Model: Map Reduce
– Data handling: Google file system
● Apache Hadoop
– Prog Model: Map Reduce
– Data Handling: HDFS (Hadoop Distributed File system)
● Microsoft Dryad
– Prog Model: DAG (Directed Acyclic Graph) execution
– Data Handling: Shared directories, Local disks
● Twister
– Prog Model: Iterative Map Reduce
– Data Handling: Local disks

© RAGINIJAIN CC SA 4.0
Demo
● Java program
– Utilizes concepts from Java 8 programming language platform.
● Lambda expressions
● Streams
– JDK ref
● java.util.Collection.stream()
● java.lang.Iterable.forEach( )
● java.util.List

© RAGINIJAIN CC SA 4.0
References
● Jeffrey Dean et' al
MapReduce: Simplified Data Processing on Large Clusters
http://research.google.com/archive/mapreduce.html
● Michelle Stonebraker et' al
MapReduce and Parallel DBMSs: Friends or Foes ?
http://dl.acm.org/citation.cfm?id=1629197
● Java Lambda expressions
https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexp
● PostgreSQL GROUP BY and ORDER BY
http://www.postgresql.org/docs/devel/static/sqlselect.html

© RAGINIJAIN CC SA 4.0
Thank you.
● Questions
● Clarifications
● Suggestions
● Feedback
Ragini Jain
15030142023@sicsr.ac.in

This document provides information about new ArcGIS Image Server and ArcSDE - MS SQL Geodatabase capabilities at H-GAC. It summarizes available aerial imagery and map services, as well as details about the CE Geodatabase, Transportation Geodatabase, and Global Geodatabase. H-GAC users have read-only access to commonly used feature classes in the Global Geodatabase. The document also notes that H-GAC has begun limited testing of ArcGIS 9.3 and plans to upgrade users by the end of the year.

Analysing Web GIS apps

M.Muneeb Ashraf

This document discusses integrating web GIS applications with monitoring tools for analysis and reporting. It provides an overview of GIS applications and web GIS, demonstrates a web GIS map application, and discusses monitoring the availability, performance, and usage of GIS services. The architecture of monitoring tools is explained, including data collection from GIS servers, windows performance counters, and log files. Examples of dashboard reports on summary data, uptime, usage, and performance from the monitoring tools are also shown.

DECK36 - Log everything! and Realtime Datastream Analytics with Storm

Mike Lohmann

MagnetoDB is an open source implementation of the Amazon DynamoDB API for OpenStack. It provides a key-value database service for storing unlimited data with scalability and predictable performance. MagnetoDB's API is compatible with existing DynamoDB clients, allowing applications using DynamoDB storage to run on OpenStack. The pilot implementation provides basic CRUD operations for items and tables and is available on GitHub under an Apache 2 license.

Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...

Microsoft Mobile Developer

This document provides an overview of Geo2Tag, a geo-tagging and location-based services platform. It describes Geo2Tag's architecture and components, how to install a Geo2Tag server, and how to integrate it into a Nokia Asha device application to provide tagging and location filtering capabilities. Examples are provided of Map widget implementation, authentication, displaying information bubbles, and making Geo2Tag API requests to perform actions like user login, database setting, location filtering, and writing tags. Resources for further development are also listed.

Dash plotly data visualization

Charu Gupta

This presentation provides an overview of interactive data visualization using the Dash-Plotly libraries. It discusses the need for data visualization in presentations, reports, machine learning and real-time data. It contrasts static versus interactive graphs and how Dash allows for changing views and updating graphs automatically from multiple sources. An overview of the Plotly and Dash libraries is given, including the basic components of graphs, figures, data and layout. Examples of interactive graphs and dashboards using these libraries will be provided.

QGIS and Altas: Automatic map generation

QGIS UK

Integrating CAD and GIS Data at Mineta San Jose International Airport

jeffhobbs

The Data Integration Application is a custom application that allows for the integration of CADD and GIS data. It uses FME to import CADD files into an Oracle database while preserving airport CADD standards. It validates data using stored procedures before loading and provides reporting. It can also export validated GIS data back into CADD files grouped by structure and floor according to a crosswalk table and drawing template. The application provides an automated solution for consistently loading and exporting large amounts of CADD data to and from their GIS data while maintaining standards and data integrity.

Streaming in the Extreme

Julius Remigio, CBIP

Streaming in the Extreme Jim Scott, Director, Enterprise Strategy & Architecture, MapR Have you ever heard of Kafka? Are you ready to start streaming all of the events in your business? What happens to your streaming solution when you outgrow your single data center? What happens when you are at a company that is already running multiple data centers and you need to implement streaming across data centers? I will discuss technologies like Kafka that can be used to accomplish, real-time, lossless messaging that works in both single and multiple globally dispersed data centers. I will also describe how to handle the data coming in through these streams in both batch processes as well as real-time processes.What about when you need to scale to a trillion events per day? I will discuss technologies like Kafka that can be used to accomplish, real-time, lossless messaging that works in both single and multiple globally dispersed data centers. I will also describe how to handle the data coming in through these streams in both batch processes as well as real-time processes. Video Presentation: https://youtu.be/Y0vxLgB1u9o

An End User Perspective on Implementing Oracle in the Engineering Environment

jeffhobbs

The document discusses the San Francisco Department of Public Works' process of migrating their sewer infrastructure data from file-based storage to an Oracle database integrated with CAD/GIS software. It describes a phased approach including loading data, training users, and adapting workflows. The new system provides a single source of accurate spatial and attribute data, improves productivity through standardized tools, and enables greater access and integration across applications. Lessons learned include allowing time for users to adapt and maintaining flexibility during implementation.

City of Roseville Case Study

jeffhobbs

This document outlines the agenda and process for integrating CAD and GIS systems for an Environmental Utility Department. [1] It describes the initial disconnected workflow between their AutoCAD, ArcInfo, and Hansen systems and the challenges this caused. [2] Ideate helped integrate the systems by replicating Hansen asset tables in AutoCAD, automatically populating attributes, and exporting to SHP files for ArcSDE. [3] This streamlined the process, improved data accuracy, and reduced staff time spent on maintenance between the different software tools.

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...

BigData_Europe

This document discusses the Big Data Europe (BDE) architecture, which aims to empower communities with data technologies. It describes the evolution of the BDE architecture and its stack, which includes components like Apache Solr, Apache Spark, and Apache Cassandra for search, processing, and storage. It also discusses the BDE workflow builder and monitor, as well as interfaces like the Swarm UI. Additionally, it outlines the Semantic Data Lake (Ontario) and Semantic Analytics Stack (SANSA) for adding meaning to data through semantics. SANSA includes layers for ingesting, querying, inferencing on, and analyzing machine-readable structured data. Finally, the document compares BDE to Hadoop distributions, noting BDE

The Whitebox Geospatial-Analyisis Tools Project and Open-Access GIS

Golgi Alvarez

The document discusses Whitebox GAT, an open-source GIS software package for processing geospatial data. It was created in 2009 and contains over 372 tools for analyzing DEMs, raster images, vector data, and LiDAR. These tools include functions for terrain analysis, hydrology, remote sensing, and LiDAR processing. The author advocates for Whitebox GAT as an "open-access GIS" that is designed to reduce barriers that discourage users from examining algorithms and implementations. It has been downloaded over 1,500 times from 91 countries in the last 17 weeks.

Enriching data by_cooking_recipes_in_cloud_dataprep

Supriya Badgujar

This document provides an overview and agenda for a presentation on Google Cloud Dataprep. It introduces Dataprep as an intelligent data preparation tool that allows for easy and powerful data preparation in a serverless environment. The presentation will cover Dataprep features, objects, pricing, permissions, importing/exporting, creating recipes and flows, running and scheduling jobs, and include a demo. It is meant to explain what Dataprep is, where it fits in the Google Cloud Platform architecture, and how to use its capabilities.

Location based services for Nokia X and Nokia Asha using Geo2tag

Microsoft Mobile Developer

With the open source Geo2tag platform, developers can use JSON or XML to manage location references in apps for Nokia X and Nokia Asha phones. In this webinar, we’ll show how to use the Geo2tag API and how to manage a local database of georeferences. We’ll begin the training by introducing the fundamentals of Location Based Services and the REST API of Geo2Tag LBS Platform (www.geo2tag.org). We’ll focus on networking, JSON and web services. Then we will demonstrate several applications developed on top of Geo2Tagand share the newest enhancements to the platform. We’ll end the training with a discussion of integrating Geo2Tag and third-party map widgets.

New opensource geospatial software stack from NextGIS

Maxim Dubinin

New open source geospatial software stack introduced by NextGIS including NextGIS Web (server backend and web interface), NextGIS Mobile (mobile app), NextGIS QGIS (desktop GIS), and NextGIS Manager (data management). The components provide functionality for data storage, management, access, visualization, editing, and integration across server, desktop, and mobile environments using open standards and open source technologies like Python, C++, PostGIS, and OpenLayers. NextGIS aims to create an integrated platform for working with geospatial data in all environments.

Designing and Using Cached Map

M.Muneeb Ashraf

This document provides an overview of designing and publishing cached map services to ArcGIS Server. It discusses: - What cached maps are and their primary purpose of pre-rendering map images for fast display and reducing server load. - The key steps for publishing an image service to ArcGIS Server, including designing data, creating a file geodatabase and mosaic dataset, publishing the image service, and creating a cache at various scales and formats. - Designing mosaic datasets to manage and serve large image collections while reducing processing time and storage needs. - Using ArcGIS Desktop and Server software to author image services from desktop data and publish them to ArcGIS Server with caches for improved performance.

Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료

BJ Jang

The document introduces mago3D, an open source platform that allows for the integration of BIM, AEC, and 3D GIS data on web browsers. Key features include rendering massive 3D models, seamlessly integrating indoor and outdoor spaces on a single platform, and supporting open standards like IFC. The platform uses a lightweight F4D format to optimize 3D data sizes and employs culling techniques for high performance rendering of large models. Real-world examples are presented to demonstrate mago3D's capabilities for applications like shipbuilding and volumetric data visualization.

GraphQL & DGraph with Go

James Tan

This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.

Sistema de recomendación entiempo real usando Delta Lake

Globant

Speaker: Valentina Grajales Video: https://youtu.be/-R5qFhnyZU0 Presentamos cómo construir un sistema de recomendación en tiempo real con entrenamiento dinámico usando operaciones de ventana en una arquitectura Kappa de Spark Delta Lake. --------------------------------------------------------------------------------------------------------------------------------------------------------------- Hay trabajos y hay carreras. Las oportunidades vienen a golpear la puerta cuando menos lo esperas. La decisión es tuya. Desde tener la oportunidad de hacer algo significativo día tras día, hasta estar rodeado de gente supremamente inteligente y motivada. ¿Estás listo? Descúbre todas nuestras oportunidades acá: https://bit.ly/2PWKky9 --------------------------------------------------------------------------------------------------------------------------------------------------------------- Síguenos en: Facebook: https://www.facebook.com/Globant/ Twitter: https://twitter.com/Globant Instagram: https://www.instagram.com/globantpics/ Linkedin: https://www.linkedin.com/company/globant Visita nuestra página web: https://bit.ly/2XLVYQD

Producing Linked Open Data with a Content Management System

Open Knowledge Belgium

MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON

ijcsit

Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.

Stratosphere with big_data_analytics

Avinash Pandu

1) Stratosphere is a distributed data processing system that extends the MapReduce model by supporting more operators and advanced data flow graphs composed of operators. 2) It has components like a query parser, compiler, and optimizer that translate queries into execution plans composed of operators like Map, Reduce, Join, Cross, CoGroup, and Union. 3) Stratosphere supports arbitrary data flows while MapReduce only supports MapReduce, and Stratosphere has better performance through in-memory processing and pipelining compared to MapReduce which always writes to disk.

What's hot

QGIS UK: QGIS Evangelism (thinkWhere)

Ross McDonald

Integrating PostGIS in Web Applications

Command Prompt., Inc

Introducing MagnetoDB, a key-value storage sevice for OpenStack

Mirantis

Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...

Microsoft Mobile Developer

Dash plotly data visualization

Charu Gupta

QGIS and Altas: Automatic map generation

QGIS UK

Integrating CAD and GIS Data at Mineta San Jose International Airport

jeffhobbs

Streaming in the Extreme

Julius Remigio, CBIP

An End User Perspective on Implementing Oracle in the Engineering Environment

jeffhobbs

City of Roseville Case Study

jeffhobbs

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...

BigData_Europe

The Whitebox Geospatial-Analyisis Tools Project and Open-Access GIS

Golgi Alvarez

Enriching data by_cooking_recipes_in_cloud_dataprep

Supriya Badgujar

Location based services for Nokia X and Nokia Asha using Geo2tag

Microsoft Mobile Developer

New opensource geospatial software stack from NextGIS

Maxim Dubinin

Designing and Using Cached Map

M.Muneeb Ashraf

Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료

BJ Jang

GraphQL & DGraph with Go

James Tan

Sistema de recomendación entiempo real usando Delta Lake

Globant

Producing Linked Open Data with a Content Management System

Open Knowledge Belgium

What's hot (20)

QGIS UK: QGIS Evangelism (thinkWhere)

Integrating PostGIS in Web Applications

Introducing MagnetoDB, a key-value storage sevice for OpenStack

Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...

Dash plotly data visualization

QGIS and Altas: Automatic map generation

Integrating CAD and GIS Data at Mineta San Jose International Airport

Streaming in the Extreme

An End User Perspective on Implementing Oracle in the Engineering Environment

City of Roseville Case Study

Apache Big_Data Europe event: "Integrators at work! Real-life applications of...

The Whitebox Geospatial-Analyisis Tools Project and Open-Access GIS

Enriching data by_cooking_recipes_in_cloud_dataprep

Location based services for Nokia X and Nokia Asha using Geo2tag

New opensource geospatial software stack from NextGIS

Designing and Using Cached Map

Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료

GraphQL & DGraph with Go

Sistema de recomendación entiempo real usando Delta Lake

Producing Linked Open Data with a Content Management System

Similar to Map Reduce

MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON

ijcsit

Stratosphere with big_data_analytics

Avinash Pandu

Download It

butest

The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.

RAPIDS cuGraph – Accelerating all your Graph needs

Connected Data World

The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks. To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library. Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph. A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML. This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.

Spark Driven Big Data Analytics

inoshg

This document provides an overview of Spark driven big data analytics. It begins by defining big data and its characteristics. It then discusses the challenges of traditional analytics on big data and how Apache Spark addresses these challenges. Spark improves on MapReduce by allowing distributed datasets to be kept in memory across clusters. This enables faster iterative and interactive processing. The document outlines Spark's architecture including its core components like RDDs, transformations, actions and DAG execution model. It provides examples of writing Spark applications in Java and Java 8 to perform common analytics tasks like word count.

Benchmarking tool for graph algorithms

Yash Khandelwal

Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム

Masayuki Matsushita

Pivotal Greenplum is a massively parallel processing (MPP) database for analytics. It provides high performance for data warehousing and big data analytics workloads. Key features include its ability to load and query data in parallel across multiple CPUs and disks, support for SQL and analytical functions and libraries like MADlib, and deployment on public clouds or on-premises. Pivotal Greenplum can be used for both structured and unstructured data and integrates with other Pivotal products like GemFire, Data Flow, and the Pivotal Data Suite for analytics workflows.

A Survey on Data Mapping Strategy for data stored in the storage cloud 111

NavNeet KuMar

This document describes a method for processing large amounts of data stored in cloud storage using Hadoop clusters. Data is uploaded to cloud storage by users and then processed using MapReduce on Hadoop clusters. The method involves storing data in the cloud for processing and then running MapReduce algorithms on Hadoop clusters to analyze the data in parallel. The results are then stored back in the cloud for users to download. An architecture is proposed involving a controller that directs requests to Hadoop masters which coordinate nodes to perform mapping and reducing of data according to the algorithm implemented.

An introduction To Apache Spark

Amir Sedighi

Introduction to GCP Data Flow Presentation

Knoldus Inc.

Introduction to GCP DataFlow Presentation

Knoldus Inc.

B04 06 0918

International Journal of Engineering Inventions www.ijeijournal.com

This document discusses a Hadoop Job Runner UI Tool that was created to make running Hadoop jobs easier. It allows users to browse input data locally, copy the data and job class to HDFS, run the job, and display results without using command lines. The tool simplifies tasks like distributing data and code, executing jobs, and retrieving output. Background information on Hadoop, MapReduce, and distributed computing environments is also provided.

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

Anant Corporation

This document discusses automating Apache Cassandra operations using Apache Airflow. It recommends using Airflow to schedule and automate workflows for ETL, data hygiene, import/export, and more. It provides an overview of using Apache Spark jobs within Airflow DAGs to perform tasks like data cleaning, deduplication, and migrations for Cassandra. The document includes demos of using Airflow and Spark with Cassandra on DataStax Astra and discusses considerations for implementing this solution.

Developing Enterprise Consciousness: Building Modern Open Data Platforms

ScyllaDB

ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere. This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools. Topics: - Business Platform Challenges - What Enterprise Consciousness Solves - How ScyllaDB Empowers Enterprise Consciousness - What can ScyllaDB do for Big Companies - What can ScyllaDB do for smaller companies.

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Mariano Gonzalez

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Tugdual Grall

Implementation of p pic algorithm in map reduce to handle big data

eSAT Publishing House

This document presents an implementation of the p-PIC clustering algorithm using the MapReduce framework to handle big data. P-PIC is a parallel version of the Power Iteration Clustering (PIC) algorithm that is able to cluster large datasets in a distributed environment. The document first provides background on PIC and challenges with scaling to big data. It then describes how p-PIC addresses these challenges using MPI for parallelization. The design of implementing p-PIC within MapReduce is presented, including the map and reduce functions. Experimental results on synthetic datasets up to 100,000 records show that p-PIC using MapReduce has increased performance and scalability compared to the original p-PIC implementation using MPI.

B04 06 0918

International Journal of Engineering Inventions www.ijeijournal.com

This document discusses a Hadoop Job Runner UI Tool that was developed to provide a graphical user interface for running Hadoop jobs. The tool allows users to browse input data locally, copy the data to HDFS, copy Java classes to remote servers, run Hadoop jobs, and copy results back from HDFS to display outputs and job statistics. The document also provides background on Hadoop and MapReduce, including an overview of how MapReduce works and how it enables distributed and parallel processing of large datasets.

Dsm Presentation

richoe

The document discusses SuperMap's GIS products and technologies. It introduces their Land Management System and Field Mapper products. It then summarizes their GIS architecture, data model, and storage solutions including support for CAD data, databases using SuperMap SDX+, and file-based SDB/SDD formats. Finally, it outlines their focus on developing a general GIS platform and mentions their customer base of over 2000 organizations.

Spark cluster computing with working sets

JinxinTang

This document introduces Spark, a new cluster computing framework that supports applications with working sets of data that are reused across multiple parallel operations. Spark introduces Resilient Distributed Datasets (RDDs), which allow efficient sharing of data across jobs by caching datasets in memory across machines. RDDs provide fault tolerance through "lineage" which allows rebuilding of lost data partitions. Early results show Spark can outperform Hadoop by 10x for iterative machine learning jobs and enable interactive querying of large datasets with sub-second response times.

Similar to Map Reduce (20)

MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON

Stratosphere with big_data_analytics

Download It

RAPIDS cuGraph – Accelerating all your Graph needs

Spark Driven Big Data Analytics

Benchmarking tool for graph algorithms

Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム

A Survey on Data Mapping Strategy for data stored in the storage cloud 111

An introduction To Apache Spark

Introduction to GCP Data Flow Presentation

Introduction to GCP DataFlow Presentation

B04 06 0918

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

Developing Enterprise Consciousness: Building Modern Open Data Platforms

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Implementation of p pic algorithm in map reduce to handle big data

B04 06 0918

Dsm Presentation

Spark cluster computing with working sets

Recently uploaded

Mule event processing models | MuleSoft Mysore Meetup #47

MysoreMuleSoftMeetup

Mule event processing models | MuleSoft Mysore Meetup #47 Event Link:- https://meetups.mulesoft.com/events/details/mulesoft-mysore-presents-mule-event-processing-models/ Agenda ● What is event processing in MuleSoft? ● Types of event processing models in Mule 4 ● Distinction between the reactive, parallel, blocking & non-blocking processing For Upcoming Meetups Join Mysore Meetup Group - https://meetups.mulesoft.com/mysore/YouTube:- youtube.com/@mulesoftmysore Mysore WhatsApp group:- https://chat.whatsapp.com/EhqtHtCC75vCAX7gaO842N Speaker:- Shivani Yasaswi - https://www.linkedin.com/in/shivaniyasaswi/ Organizers:- Shubham Chaurasia - https://www.linkedin.com/in/shubhamchaurasia1/ Giridhar Meka - https://www.linkedin.com/in/giridharmeka Priya Shaw - https://www.linkedin.com/in/priya-shaw

Standardized tool for Intelligence test.

deepaannamalai16

Educational Technology in the Health Sciences

Iris Thiele Isip-Tan

RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students

Himanshu Rai

Electric Fetus - Record Store Scavenger Hunt

RamseyBerglund

What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...

GeorgeMilliken2

spot a liar (Haiqa 146).pptx Technical writhing and presentation skills

haiqairshad

How Barcodes Can Be Leveraged Within Odoo 17

Celine George

Skimbleshanks-The-Railway-Cat by T S Eliot

nitinpv4ai

MDP on air pollution of class 8 year 2024-2025

khuleseema60

BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...

Nguyen Thanh Tu Collection

Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...

EduSkills OECD

SWOT analysis in the project Keeping the Memory @live.pptx

zuzanka

Bonku-Babus-Friend by Sathyajith Ray (9)

nitinpv4ai

RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx

zuzanka

CIS 4200-02 Group 1 Final Project Report (1).pdf

blueshagoo1

BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx

RidwanHassanYusuf

Stack Memory Organization of 8086 Microprocessor

JomonJoseph58

Data Structure using C by Dr. K Adisesha .ppsx

Prof. Dr. K. Adisesha

Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...

TechSoup

Recently uploaded (20)

Mule event processing models | MuleSoft Mysore Meetup #47

Standardized tool for Intelligence test.

Educational Technology in the Health Sciences

RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students

Electric Fetus - Record Store Scavenger Hunt

What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...

spot a liar (Haiqa 146).pptx Technical writhing and presentation skills

How Barcodes Can Be Leveraged Within Odoo 17

Skimbleshanks-The-Railway-Cat by T S Eliot

MDP on air pollution of class 8 year 2024-2025

BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...

Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...

SWOT analysis in the project Keeping the Memory @live.pptx

Bonku-Babus-Friend by Sathyajith Ray (9)

RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx

CIS 4200-02 Group 1 Final Project Report (1).pdf

BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx

Stack Memory Organization of 8086 Microprocessor

Data Structure using C by Dr. K Adisesha .ppsx

Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...

Map Reduce

2. © RAGINIJAIN CC SA 4.0 Overview ● What is Map Reduce ● Map Reduce schematic ● Map Reduce in detail ● Comparison of Map Reduce models ● Demo ● References

3. © RAGINIJAIN CC SA 4.0 What is Map Reduce ● A software framework which supports – Parallel – Distributed computing on large data sets. ● The framework abstracts the data flow of running a parallel program on a distributed computing system by providing users with two interfaces in the form of functions: – Map – Reduce ● Users can control and manipulate the data flow of their programs by overriding the Map() and Reduce() function ● Map Reduce library is the controller.

6. © RAGINIJAIN CC SA 4.0 Map Reduce (in detail) ● The Map function is applied in parallel to every input (key, value) pair and produces new set of intermediate (key, value) pairs (key1, val1) ------(map function)---> List (key2, val2) ● Then the MapReduce library collects all the produced intermediate (key, value) pairs from all input (key, val) pairs and sorts them based on the key part ● Finally Reduce function is applied in parallel to each group producing the collection of values (key2, List(val2)) -----(reduce function) ---> List (val2)

7. © RAGINIJAIN CC SA 4.0 Map Reduce (as a query framework) ● SQL clauses that are the building block for Map Reduce operations on structured data and data warehouses – GROUP BY – ORDER BY ● On a very large set of demographic data SELECT age, AVG(contacts) FROM social.person GROUP BY age ORDER BY age

9. © RAGINIJAIN CC SA 4.0 Comparison Map Reduce models ● Google Map Reduce – Prog Model: Map Reduce – Data handling: Google file system ● Apache Hadoop – Prog Model: Map Reduce – Data Handling: HDFS (Hadoop Distributed File system) ● Microsoft Dryad – Prog Model: DAG (Directed Acyclic Graph) execution – Data Handling: Shared directories, Local disks ● Twister – Prog Model: Iterative Map Reduce – Data Handling: Local disks

10. © RAGINIJAIN CC SA 4.0 Demo ● Java program – Utilizes concepts from Java 8 programming language platform. ● Lambda expressions ● Streams – JDK ref ● java.util.Collection.stream() ● java.lang.Iterable.forEach( ) ● java.util.List

11. © RAGINIJAIN CC SA 4.0 References ● Jeffrey Dean et' al MapReduce: Simplified Data Processing on Large Clusters http://research.google.com/archive/mapreduce.html ● Michelle Stonebraker et' al MapReduce and Parallel DBMSs: Friends or Foes ? http://dl.acm.org/citation.cfm?id=1629197 ● Java Lambda expressions https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexp ● PostgreSQL GROUP BY and ORDER BY http://www.postgresql.org/docs/devel/static/sqlselect.html

12. © RAGINIJAIN CC SA 4.0 Thank you. ● Questions ● Clarifications ● Suggestions ● Feedback Ragini Jain 15030142023@sicsr.ac.in

Map Reduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Map Reduce

Similar to Map Reduce (20)

Recently uploaded

Recently uploaded (20)

Map Reduce