The document discusses several shortcomings of the MapReduce paradigm, as outlined by others. It notes that while Hadoop has become a popular solution for large-scale data processing, MapReduce has limitations, including that it is a low-level programming interface, only supports batch processing, and has issues with uneven data distribution (skew) and lack of support for iterative/recursive applications and incremental computation. Critics argue it is a step backwards from database technologies and is missing functionality like updates, transactions, and tools for data visualization.
The document provides an abstract for a paper on the Hadoop framework. It discusses how Hadoop is a software framework that supports data-intensive distributed applications under an open source license. It was inspired by Google's MapReduce and Google File System papers. The paper will represent the history, development, and current situation of Hadoop technology. It is now maintained by the Apache Software Foundation via Cloudera. The paper will include chapters on an introduction to Hadoop, its history, key technologies like MapReduce and HDFS, other related Apache projects, and instructions for setting up a single node Hadoop cluster.
Apache Hadoop es un framework de software que soporta aplicaciones distribuidas bajo una licencia libre.1 Permite a las aplicaciones trabajar con miles de nodos y petabytes de datos. Hadoop se inspiró en los documentos Google para MapReduce y Google File System (GFS).
Hadoop es un proyecto de alto nivel Apache que está siendo construido y usado por una comunidad global de contribuyentes,2 mediante el lenguaje de programación Java. Yahoo! ha sido el mayor contribuyente al proyecto,3 y usa Hadoop extensivamente en su negocio
Understanding Big Data summarizes big data and popular big data technologies. It discusses how big data is generated from various sources and is too large to be processed by traditional databases. Popular technologies like Hadoop, HDFS, MapReduce, Hive, Pig, HBase, and Mahout are able to collect, store, process, and analyze big data. Companies are using big data to gain insights from customer data, optimize operations, prevent fraud, and make recommendations.
This document provides an overview of a Hadoop session that will cover:
1. An introduction to big data including the history and evolution of Hadoop and how it addresses challenges with traditional databases.
2. The Hadoop architecture and ecosystem including components like HDFS, MapReduce, HBase and how they address issues with scalability, flexibility and cost compared to traditional databases.
3. Hands-on analysis of a soccer dataset using Hadoop to perform tasks like data classification, prediction and player analysis.
re:Introduce Big Data and Hadoop Eco-system.Shakir Ali
This document provides an overview of big data and the Hadoop ecosystem. It defines big data as large and complex datasets that are difficult to process using traditional data management tools. Characteristics of big data include volume, variety, velocity and veracity. The document discusses challenges of managing big data and how Hadoop provides solutions through its distributed architecture. It also summarizes some prominent Apache projects in the Hadoop ecosystem like Pig, Hive, Spark and Hbase.
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
The document provides an abstract for a paper on the Hadoop framework. It discusses how Hadoop is a software framework that supports data-intensive distributed applications under an open source license. It was inspired by Google's MapReduce and Google File System papers. The paper will represent the history, development, and current situation of Hadoop technology. It is now maintained by the Apache Software Foundation via Cloudera. The paper will include chapters on an introduction to Hadoop, its history, key technologies like MapReduce and HDFS, other related Apache projects, and instructions for setting up a single node Hadoop cluster.
Apache Hadoop es un framework de software que soporta aplicaciones distribuidas bajo una licencia libre.1 Permite a las aplicaciones trabajar con miles de nodos y petabytes de datos. Hadoop se inspiró en los documentos Google para MapReduce y Google File System (GFS).
Hadoop es un proyecto de alto nivel Apache que está siendo construido y usado por una comunidad global de contribuyentes,2 mediante el lenguaje de programación Java. Yahoo! ha sido el mayor contribuyente al proyecto,3 y usa Hadoop extensivamente en su negocio
Understanding Big Data summarizes big data and popular big data technologies. It discusses how big data is generated from various sources and is too large to be processed by traditional databases. Popular technologies like Hadoop, HDFS, MapReduce, Hive, Pig, HBase, and Mahout are able to collect, store, process, and analyze big data. Companies are using big data to gain insights from customer data, optimize operations, prevent fraud, and make recommendations.
This document provides an overview of a Hadoop session that will cover:
1. An introduction to big data including the history and evolution of Hadoop and how it addresses challenges with traditional databases.
2. The Hadoop architecture and ecosystem including components like HDFS, MapReduce, HBase and how they address issues with scalability, flexibility and cost compared to traditional databases.
3. Hands-on analysis of a soccer dataset using Hadoop to perform tasks like data classification, prediction and player analysis.
re:Introduce Big Data and Hadoop Eco-system.Shakir Ali
This document provides an overview of big data and the Hadoop ecosystem. It defines big data as large and complex datasets that are difficult to process using traditional data management tools. Characteristics of big data include volume, variety, velocity and veracity. The document discusses challenges of managing big data and how Hadoop provides solutions through its distributed architecture. It also summarizes some prominent Apache projects in the Hadoop ecosystem like Pig, Hive, Spark and Hbase.
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
This document discusses big data tools and management at large scales. It introduces Hadoop, an open-source software framework for distributed storage and processing of large datasets using MapReduce. Hadoop allows parallel processing of data across thousands of nodes and has been adopted by large companies like Yahoo!, Facebook, and Baidu to manage petabytes of data and perform tasks like sorting terabytes of data in hours.
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
This document provides an overview of big data and various big data tools including Pig, Hive, and Cascading. It discusses the history and motivation for each tool, how they work by mapping operations to MapReduce jobs, and compares key aspects of their data models, typing, and procedural vs declarative styles. The document is intended as a training presentation on these popular big data frameworks.
The document discusses big data and its applications. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It outlines the three V's of big data - volume, variety, and velocity. Various types of structured, semi-structured, and unstructured data are described. Examples are given of how big data is used in various industries like automotive, finance, manufacturing, policing, and utilities to improve products, detect fraud, perform simulations, track suspects, and monitor assets. Popular big data software like Hadoop and MongoDB are also mentioned.
The document discusses big data, including what it is, sources of big data like social media and stock exchange data, and the three Vs of big data - volume, velocity, and variety. It then discusses Hadoop, the open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed computation, and YARN which manages computing resources. The document also provides overviews of Pig and Jaql, programming languages used for analyzing data in Hadoop.
How to build and run a big data platform in the 21st centuryAli Dasdan
The document provides an overview of big data platform architectures that have been built by various companies and organizations. It discusses self-built platforms from companies like Airbnb, Netflix, Facebook, Slack, and Uber. It also covers cloud-built platforms on IBM Cloud, Microsoft Azure, Google Cloud, and Amazon AWS. Consulting-built platforms from Cloudera and ThoughtWorks are presented. Finally, it introduces the NIST Big Data Reference Architecture as a standard reference model and discusses generic batch vs streaming architectures like Lambda and Kappa.
This document presents information on Big Data and its association with Hadoop. It discusses what Big Data is, defining it as too large and complex for traditional databases. It also covers the 3 V's of Big Data: volume, variety, and velocity. The document then introduces Hadoop as a tool for Big Data analytics, describing what Hadoop is and its key components and features like being scalable, reliable, and economical. MapReduce is discussed as Hadoop's programming model using mappers and reducers. Finally, the document concludes that Hadoop enables distributed, parallel processing of large data across inexpensive servers.
The document is a seminar report on the Hadoop framework. It provides an introduction to Hadoop and describes its key technologies including MapReduce, HDFS, and programming model. MapReduce allows distributed processing of large datasets across clusters. HDFS is the distributed file system used by Hadoop to reliably store large amounts of data across commodity hardware.
The document discusses big data analytics using Hadoop. It provides an introduction to big data and the 5 V's of big data. It then discusses limitations of relational database management systems for big data. The document outlines the history and development of Hadoop. It describes the components, architecture and advantages of Hadoop. Some applications of Hadoop for big data analytics are also highlighted along with disadvantages. The conclusion reiterates that Hadoop is an open source tool for handling big data analytics.
Topic 5: MapReduce Theory and ImplementationZubair Nabi
The document discusses MapReduce theory and implementation. It describes how MapReduce was designed by Google engineers to abstract complex distributed computations by allowing programmers to specify map and reduce functions. The core of MapReduce is applying map tasks in parallel on data blocks and collecting the outputs by key before applying reduce tasks. Implementations involve a master coordinating work across a cluster of shared-nothing commodity machines. Common examples like word count are provided to illustrate the programming model.
This document discusses technologies used for big data. It describes big data as massive volumes of structured and unstructured data that is difficult to process using traditional databases. It then lists and describes 10 technologies used for big data, including column-oriented databases, schema-less databases, MapReduce, Hadoop, Hive, Pig, WibiData, Platfora, storage technologies, and SkyTree. These technologies allow for processing and analyzing large datasets.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
This document discusses MapReduce application scripting. It provides an overview of Pig Latin and Cascading, two frameworks for writing MapReduce applications in a declarative way. Pig Latin scripts data flows as a sequence of steps and allows for custom user-defined functions. Cascading allows creating MapReduce pipelines using JVM languages with a source-pipe-sink paradigm. The document defines key terminology and provides examples of MapReduce jobs written in Pig Latin.
1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3.
2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments.
3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
This document proposes using a ranking algorithm and sampling algorithm to improve the performance of a heterogeneous Hadoop cluster. The ranking algorithm prioritizes data distribution based on node frequency, so that higher frequency nodes are processed first. The sampling algorithm randomly selects nodes for data distribution instead of evenly distributing across all nodes. The proposed approach reduces computation time and improves overall cluster performance compared to the existing approach of evenly distributing data across nodes of varying sizes. Results show the proposed approach reduces execution time for various file sizes compared to the existing approach.
The document discusses Big Data, MapReduce, Hadoop, and Pydoop. It provides an overview of MapReduce and how it works, describing the map and reduce functions. It also describes Hadoop, the popular open-source implementation of MapReduce, including its architecture and core components like HDFS and how tasks are executed in a distributed manner. Finally, it briefly introduces Pydoop as a way to use Python with Hadoop.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
AN OVERVIEW OF BIGDATA AND HADOOP . THE ARCHITECHTURE IT USES AND THE WAY IT WORKS ON THE DATA SETS. THE SIDES ALSO SHOW THE VARIOUS FIELDS WHERE THEY ARE MOSTLY USED AND IMPLIMENTED
LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.
This document discusses big data tools and management at large scales. It introduces Hadoop, an open-source software framework for distributed storage and processing of large datasets using MapReduce. Hadoop allows parallel processing of data across thousands of nodes and has been adopted by large companies like Yahoo!, Facebook, and Baidu to manage petabytes of data and perform tasks like sorting terabytes of data in hours.
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
This document provides an overview of big data and various big data tools including Pig, Hive, and Cascading. It discusses the history and motivation for each tool, how they work by mapping operations to MapReduce jobs, and compares key aspects of their data models, typing, and procedural vs declarative styles. The document is intended as a training presentation on these popular big data frameworks.
The document discusses big data and its applications. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It outlines the three V's of big data - volume, variety, and velocity. Various types of structured, semi-structured, and unstructured data are described. Examples are given of how big data is used in various industries like automotive, finance, manufacturing, policing, and utilities to improve products, detect fraud, perform simulations, track suspects, and monitor assets. Popular big data software like Hadoop and MongoDB are also mentioned.
The document discusses big data, including what it is, sources of big data like social media and stock exchange data, and the three Vs of big data - volume, velocity, and variety. It then discusses Hadoop, the open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed computation, and YARN which manages computing resources. The document also provides overviews of Pig and Jaql, programming languages used for analyzing data in Hadoop.
How to build and run a big data platform in the 21st centuryAli Dasdan
The document provides an overview of big data platform architectures that have been built by various companies and organizations. It discusses self-built platforms from companies like Airbnb, Netflix, Facebook, Slack, and Uber. It also covers cloud-built platforms on IBM Cloud, Microsoft Azure, Google Cloud, and Amazon AWS. Consulting-built platforms from Cloudera and ThoughtWorks are presented. Finally, it introduces the NIST Big Data Reference Architecture as a standard reference model and discusses generic batch vs streaming architectures like Lambda and Kappa.
This document presents information on Big Data and its association with Hadoop. It discusses what Big Data is, defining it as too large and complex for traditional databases. It also covers the 3 V's of Big Data: volume, variety, and velocity. The document then introduces Hadoop as a tool for Big Data analytics, describing what Hadoop is and its key components and features like being scalable, reliable, and economical. MapReduce is discussed as Hadoop's programming model using mappers and reducers. Finally, the document concludes that Hadoop enables distributed, parallel processing of large data across inexpensive servers.
The document is a seminar report on the Hadoop framework. It provides an introduction to Hadoop and describes its key technologies including MapReduce, HDFS, and programming model. MapReduce allows distributed processing of large datasets across clusters. HDFS is the distributed file system used by Hadoop to reliably store large amounts of data across commodity hardware.
The document discusses big data analytics using Hadoop. It provides an introduction to big data and the 5 V's of big data. It then discusses limitations of relational database management systems for big data. The document outlines the history and development of Hadoop. It describes the components, architecture and advantages of Hadoop. Some applications of Hadoop for big data analytics are also highlighted along with disadvantages. The conclusion reiterates that Hadoop is an open source tool for handling big data analytics.
Topic 5: MapReduce Theory and ImplementationZubair Nabi
The document discusses MapReduce theory and implementation. It describes how MapReduce was designed by Google engineers to abstract complex distributed computations by allowing programmers to specify map and reduce functions. The core of MapReduce is applying map tasks in parallel on data blocks and collecting the outputs by key before applying reduce tasks. Implementations involve a master coordinating work across a cluster of shared-nothing commodity machines. Common examples like word count are provided to illustrate the programming model.
This document discusses technologies used for big data. It describes big data as massive volumes of structured and unstructured data that is difficult to process using traditional databases. It then lists and describes 10 technologies used for big data, including column-oriented databases, schema-less databases, MapReduce, Hadoop, Hive, Pig, WibiData, Platfora, storage technologies, and SkyTree. These technologies allow for processing and analyzing large datasets.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
This document discusses MapReduce application scripting. It provides an overview of Pig Latin and Cascading, two frameworks for writing MapReduce applications in a declarative way. Pig Latin scripts data flows as a sequence of steps and allows for custom user-defined functions. Cascading allows creating MapReduce pipelines using JVM languages with a source-pipe-sink paradigm. The document defines key terminology and provides examples of MapReduce jobs written in Pig Latin.
1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3.
2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments.
3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
This document proposes using a ranking algorithm and sampling algorithm to improve the performance of a heterogeneous Hadoop cluster. The ranking algorithm prioritizes data distribution based on node frequency, so that higher frequency nodes are processed first. The sampling algorithm randomly selects nodes for data distribution instead of evenly distributing across all nodes. The proposed approach reduces computation time and improves overall cluster performance compared to the existing approach of evenly distributing data across nodes of varying sizes. Results show the proposed approach reduces execution time for various file sizes compared to the existing approach.
The document discusses Big Data, MapReduce, Hadoop, and Pydoop. It provides an overview of MapReduce and how it works, describing the map and reduce functions. It also describes Hadoop, the popular open-source implementation of MapReduce, including its architecture and core components like HDFS and how tasks are executed in a distributed manner. Finally, it briefly introduces Pydoop as a way to use Python with Hadoop.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
AN OVERVIEW OF BIGDATA AND HADOOP . THE ARCHITECHTURE IT USES AND THE WAY IT WORKS ON THE DATA SETS. THE SIDES ALSO SHOW THE VARIOUS FIELDS WHERE THEY ARE MOSTLY USED AND IMPLIMENTED
We are in the age of big data which involves collection of large datasets.Managing and processing large data sets is difficult with existing traditional database systems.Hadoop and Map Reduce has become one of the most powerful and popular tools for big data processing . Hadoop Map Reduce a powerful programming model is used for analyzing large set of data with parallelization, fault tolerance and load balancing and other features are it is elastic,scalable,efficient.MapReduce with cloud is combined to form a framework for storage, processing and analysis of massive machine maintenance data in a cloud computing environment.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
This document provides an overview and agenda for a presentation on how Google handles big data. The presentation covers Google Cloud Platform and how it can be used to run Hadoop clusters on Google Compute Engine and leverage BigQuery for analytics. It also discusses how Google processes big data internally using technologies like MapReduce, BigTable and Dremel and how these concepts apply to customer use cases.
This document discusses topics related to NoSQL data management and distribution models in big data analytics. It covers key-value and document data models, as well as graph databases and schema-less databases. It then describes several distribution models including single server, sharding, master-slave replication, peer-to-peer replication, and combining sharding and replication. Specific examples of these models in MongoDB and Cassandra are provided. The next session will cover Cassandra's data model.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
This talk was for GDG Fresno meeting. The demo used Google Compute Engine and Google Cloud Storage. The actual talk was different than the slides. There were a lot of good questions from the audience, and diverted to side topics many times.
This document discusses network communication in Unix systems. It describes how the networking infrastructure abstracts different network architectures and consists of network protocols, address families, and additional facilities. It also summarizes the network subsystem layers, memory management using mbufs, data flow between sockets and the network, common network protocols, network interfaces, routing, and protocol control blocks.
This document discusses the background and advantages of virtualization. It describes how IBM originally solved the problem of running multiple operating systems on the same machine by adding a virtual memory monitor or hypervisor. The hypervisor sits between operating systems and hardware, giving each OS the illusion of full hardware control while actually multiplexing hardware access. This allows server consolidation by running multiple OSes on fewer physical servers. The document then discusses challenges of virtualizing privileged operations, system calls, and virtual memory that require interception and emulation by the hypervisor.
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
This document provides a summary of file system concepts in the xv6 operating system, including:
1) Inodes are data structures that represent files and provide metadata and pointers to file data blocks. On-disk inodes are read into memory inodes when files are accessed.
2) Directories are represented by special directory inodes containing directory entries with names and pointers to other inodes.
3) The file system layout divides the disk into sections for the boot sector, superblock, inodes, bitmap, data blocks, and log for atomic transactions.
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
The document describes the file system layers in xv6, including the buffer cache, logging, and on-disk layout. The buffer cache synchronizes access to disk blocks and caches popular blocks in memory. The logging layer ensures atomicity by wrapping file system updates in transactions written to a log on disk before writing to the file system structures. The on-disk layout divides the disk into sections for the boot sector, superblock, inodes, bitmap, data blocks, and log blocks.
AOS Lab 8: Interrupts and Device DriversZubair Nabi
This document discusses interrupts, device drivers, and the xv6 operating system. It provides recaps of previous labs on extraordinary events like interrupts, exceptions, and system calls. It explains how interrupts are handled on multi-processor systems using the I/O APIC to route interrupts and the LAPIC as a per-CPU interrupt controller. An example is given of how timer interrupts are used to track time and scheduling. Device drivers are introduced as code that manages devices by providing interrupt handlers and controlling device operations. The disk driver is given as an example to copy data between disk and memory in 512-byte sectors.
Page tables allow the OS to multiplex process address spaces onto physical memory, protect memory between processes, and map kernel memory in user address spaces. Page tables are stored as a two-level tree structure with a page directory and page table pages. Virtual addresses are translated to physical addresses by indexing the page directory and table to obtain the physical page number in the page table entry.
The document discusses process scheduling in an operating system. It describes how an OS runs more processes than it has processors by providing each process with a virtual processor and multiplexing these across physical processors. When a process performs I/O or its time quantum expires, the scheduler selects another process to run using a timer interrupt. Context switching involves saving the context of the current process and restoring the next process using the swtch function. The scheduler runs in a loop, acquiring the process table lock to select a RUNNABLE process and releasing it to allow other CPUs access between iterations.
The document discusses system calls and how they are handled in operating systems. It explains that system calls allow user processes to request services from the kernel by generating an interrupt that switches the processor into kernel mode. On x86 processors, the interrupt handler saves process state and routes the call to the appropriate kernel code based on an interrupt descriptor table with 256 entries. The document provides details on how Linux/x86 implements system calls, exceptions, and interrupts using the IDT, and switches between user and kernel mode to maintain isolation.
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
The document discusses concurrency issues that arise in operating systems and how xv6 handles them using locks. It begins by explaining how multiple CPUs can interfere with each other when sharing kernel data structures. It also notes that even on single-CPU systems, interrupt handlers can interfere with non-interrupt code. xv6 uses locks to address concurrency for both of these situations. The document then provides examples of race conditions that can occur without locks, such as when multiple processors concurrently add to a shared linked list. It shows how xv6 implements locks and how they are used to make operations like inserting into a linked list atomic. The document also discusses challenges like lock ordering, handling locks for interrupt handlers, and when to use coarse
The document describes the process of starting a process on a PC. It explains that when a PC boots, the BIOS starts executing and loads the boot loader from the boot disk sector. The boot loader then loads the kernel into memory and jumps to it. The kernel boot loader then initializes devices and creates the first process by setting up its page table and memory space. The first process's state is set to runnable and the scheduler runs it, switching to its address space. The first process makes a system call to load the /init program, which creates the console and shell that runs as the main process.
1) xv6 is a reimplementation of the Unix Version 6 operating system (V6) in ANSI C. It is used at MIT for teaching operating systems concepts.
2) The document discusses installing xv6 on a system by cloning its source code from GitHub and compiling it. Key steps include installing dependencies, QEMU, and cloning the xv6 source code.
3) An overview of xv6's structure is provided, noting it is a monolithic kernel that provides services to user processes via system calls, allowing processes to alternate between user and kernel space.
This document provides an introduction to Linux and common Linux commands. It discusses key facts about Unix, how Linux is based on Unix, popular Linux distributions like Ubuntu, and common file system layout and commands for manipulating files and directories. The document concludes with an assignment to write a Bash script to analyze and compare British and American English dictionaries.
The document summarizes the key components of the big data stack, from the presentation layer where users interact, through various processing and storage layers, down to the physical infrastructure of data centers. It provides examples like Facebook's petabyte-scale data warehouse and Google's globally distributed database Spanner. The stack aims to enable the processing and analysis of massive datasets across clusters of servers and data centers.
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
This document proposes Raabta, a low-cost video conferencing system for developing regions. Raabta leverages existing analog cable TV networks and uses inexpensive Raspberry Pi devices as endpoints. It was designed with principles of low cost, low power usage, tolerance of failure-prone environments, and a simple interface. The system avoids reliance on internet connectivity by using the cable networks for both upstream and downstream video streams encoded for robust transmission. This approach could enable affordable, widespread communication tools for communities with limited infrastructure and resources.
The Anatomy of Web Censorship in PakistanZubair Nabi
This document summarizes a study on internet censorship in Pakistan. It found that censorship mechanisms in Pakistan were upgraded in mid-2013 from ISP-level blocking to centralized blocking at the internet exchange point (IXP) level. Most websites were blocked through DNS redirection, while some used HTTP redirection. After the upgrade, blocking was done through 200 response packets injected at the IXP level. Public VPNs and web proxies were popular ways for citizens to circumvent restrictions.
This document discusses Hive, an open source data warehousing system built on top of Hadoop. Hive allows users to query data stored in Hadoop using a SQL-like language called HiveQL. Queries are compiled into MapReduce jobs for execution. The document describes Hive's data model, data types, HiveQL language, and metastore. It provides an example of using Hive to analyze Facebook status updates.
Topic 15: Datacenter Design and NetworkingZubair Nabi
The document discusses datacenter network design and transport protocols. It begins with an introduction to traditional datacenter network topologies, which use a 2-3 level tree structure. It then covers fat-tree and DCell topologies as alternatives. The document also discusses how TCP, while commonly used, is not optimal for datacenter networks due to design assumptions like round-trip time that differ from wide-area networks. It suggests transport protocols designed for datacenter characteristics could improve performance.
Topic 14: Operating Systems and VirtualizationZubair Nabi
The document discusses operating systems and virtualization. It provides an overview of several Linux distributions including their key features and use cases. It also describes Xen, a hypervisor used to run multiple virtual machines on a single physical machine. Xen uses a dom0 domain to control hardware access and export virtual devices to domU guest virtual machines. I/O is handled through backend and frontend device drivers in the dom0 and domUs respectively.
The document discusses different cloud computing stacks, including CloudStack and OpenStack. It provides details on the components and features of each stack. CloudStack is presented as a console for managing data center resources like virtual machines, networking, and storage. It enables IaaS capabilities. OpenStack is described as an open source software for building public and private clouds, with components that manage compute, storage, networking, identity, and dashboards. It supports multiple hypervisors and is used by many large companies.
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
This document discusses using Mininet, an emulator for real-world networks that uses real kernel, switch, and application code on a single machine. It describes how Mininet uses Linux containers to emulate hosts, switches, and links. It also explains that Mininet creates a container and network namespace for each virtual host, with virtual interfaces connecting hosts to software switches via veth links. Finally, it briefly outlines Mininet's command line and Python interfaces.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Topic 7: Shortcomings in the MapReduce Paradigm
1. 7: Shortcomings in the MapReduce Paradigm
Zubair Nabi
zubair.nabi@itu.edu.pk
April 19, 2013
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 1 / 31
2. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 2 / 31
3. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 3 / 31
4. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
5. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
6. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
7. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
8. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
9. Users1
Adobe: Several areas from social services to unstructured data storage
and processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest cluster
has 4500 nodes!
1
http://wiki.apache.org/hadoop/PoweredBy
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
10. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
11. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
12. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
13. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
14. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
It is just rehashing old database concepts
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
15. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
It is just rehashing old database concepts
It is missing most DBMS functionalities, such as updates, transactions,
etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
16. But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to data
intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push
It is just rehashing old database concepts
It is missing most DBMS functionalities, such as updates, transactions,
etc.
It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
17. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 6 / 31
18. Introduction
Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
19. Introduction
Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work
Such reducers become “stragglers”
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
20. Introduction
Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work
Such reducers become “stragglers”
A large number of real-world applications follow long-tailed distributions
(Zipf-like)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
21. Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of words
account for most occurrences
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
22. Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of words
account for most occurrences
For instance, of 242,758 words in the dataset used to generate the
figure, the 10, 100, and 1000 most frequent words account for 22%,
43%, and 64% of the entire set
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
23. Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of words
account for most occurrences
For instance, of 242,758 words in the dataset used to generate the
figure, the 10, 100, and 1000 most frequent words account for 22%,
43%, and 64% of the entire set
Such skewed intermediate results lead to uneven distribution of
workload across reduce workers
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
24. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
25. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
26. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
27. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
28. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
29. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact that
Google currently indexes more than 25 billion webpages with skewed
links
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
30. Page rank and skew
Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a given
search query
Map: Emit the outlinks for each page
Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact that
Google currently indexes more than 25 billion webpages with skewed
links
For instance, Facebook has 49,376,609 incoming links (at the time of
writing) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
31. Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
32. Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users and
content
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
33. Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users and
content
Web caching schemes as well as email and social networks
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
34. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 11 / 31
35. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
36. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
The framework launches a speculative copy of each straggler on
another machine expecting it to finish quickly
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
37. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
The framework launches a speculative copy of each straggler on
another machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the slowest
straggler
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
38. Introduction
In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”
The framework launches a speculative copy of each straggler on
another machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the slowest
straggler
On Google clusters, speculative execution can reduce job completion
by 44%
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
39. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
40. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
41. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
42. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
43. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
44. Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
45. Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
46. Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
47. Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less the
same rate
2 Tasks make progress at a constant rate throughout their lifetime
Both breakdown in heterogeneous environments which consist of
multiple generations of hardware
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
48. Assumption 3
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
49. Assumption 3
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node
Breaks down due to shared resources
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
50. Assumption 4
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
51. Assumption 4
4 The progress score of a task captures the fraction of its total work that
it has done. Specifically, the shuffle, merge, and reduce logic phases
each take roughly 1/3 of the total time
Breaks down due the fact that in reduce tasks the shuffle phase takes
the longest time as opposed to the other 2
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
52. Assumption 5
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
53. Assumption 5
5 As tasks finish in waves, a task with a low progress score is most likely
a straggler
Breaks down due to the fact that task completion is spread across time
due to uneven workload
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
54. Assumption 6
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
55. Assumption 6
6 Tasks within the same phase, require roughly the same amount of work
Breaks down due to data skew
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
56. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 19 / 31
57. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
58. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Hacks need to be put into place for different data flow, such as joins or
multiple stages
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
59. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Hacks need to be put into place for different data flow, such as joins or
multiple stages
Custom code has to be written for common DB operations, such as
projection and filtering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
60. Introduction
The one-input, two-stage data flow is extremely rigid for ad-hoc
analysis of large datasets
Hacks need to be put into place for different data flow, such as joins or
multiple stages
Custom code has to be written for common DB operations, such as
projection and filtering
The opaque nature of map and reduce functions makes it impossible to
perform optimizations, such as operator reordering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
61. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 21 / 31
62. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
63. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Simplifies fault-tolerance
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
64. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
65. Introduction
In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (stream
processing)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
66. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 23 / 31
67. Introduction
1 Not all applications can be broken down into just two-phases, such as
complex SQL-like queries
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
68. Introduction
1 Not all applications can be broken down into just two-phases, such as
complex SQL-like queries
2 Tasks take in just one input and produce one output
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
69. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 25 / 31
70. Introduction
1 Hadoop is widely employed for iterative computations
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
71. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
72. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
73. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
74. Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is used
atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
75. Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 27 / 31
76. Introduction
1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
77. Introduction
1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input
2 For instance, most iterations of PageRank run with very small
modifications
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
78. Introduction
1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input
2 For instance, most iterations of PageRank run with very small
modifications
3 Unfortunately, even with a small change in input, MapReduce
re-performs the entire computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
79. References
1 MapReduce: A major step backwards:
http://homes.cs.washington.edu/~billhowe/
mapreduce_a_major_step_backwards.html
2 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and
Ion Stoica. 2008. Improving MapReduce performance in
heterogeneous environments. In Proceedings of the 8th USENIX
conference on Operating systems design and implementation
(OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42.
3 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 29 / 31
80. References (2)
4 Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,
Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In
Proceedings of the 7th USENIX conference on Networked systems
design and implementation (NSDI’10). USENIX Association, Berkeley,
CA, USA.
5 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
Fetterly. 2007. Dryad: distributed data-parallel programs from
sequential building blocks. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007
(EuroSys ’07). ACM, New York, NY, USA, 59-72.
6 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven
Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal
execution engine for distributed data-flow computing. In Proceedings of
the 8th USENIX conference on Networked systems design and
implementation (NSDI’11). USENIX Association, Berkeley, CA, USA.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 30 / 31
81. References (3)
7 Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A.
Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incremental
computations. In Proceedings of the 2nd ACM Symposium on Cloud
Computing (SOCC ’11). ACM, New York, NY, USA.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 31 / 31