Relational databases are useful for their transactional and query capabilities but do not scale linearly. Distributed "shared nothing" architectures like Hadoop MapReduce are more scalable by making work independent and parallelizable. However, real-world problems often require sharing data between processes. A common pattern is to use MapReduce for bulk processing and loading data, then use databases or other technologies for interactive queries and iterative jobs. Choosing the right data storage technology depends on an application's specific needs.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
This document summarizes a research paper that proposes using in-node combiners to improve the performance of Hadoop MapReduce jobs. It discusses how MapReduce jobs are I/O intensive and describes two common bottlenecks: during the map phase when data is loaded from disks, and during the shuffle phase when intermediate results are transferred over the network. The paper introduces an in-node combiner approach to optimize I/O by locally aggregating intermediate results within nodes to reduce network traffic between mappers and reducers. It evaluates this approach through an experiment counting word occurrences in Twitter messages.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
This document describes a project analyzing the impact of 2 and 4 feet of sea level rise on Broward County, FL land use using GIS. The author used elevation data, land use data, and the raster calculator to determine areas above and below the water levels. Original land use categories were dissolved and converted to raster format. The land use raster was overlaid with the sea level rise rasters to quantify changes. The results found high-rise residential and conservation areas most impacted by 2 feet of rise and almost half the county impacted by 4 feet of rise, showing the area's vulnerability.
The document provides an overview of Hadoop, describing it as an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It discusses key Hadoop components like HDFS for storage, MapReduce for distributed processing, and YARN for resource management. The document also gives examples of how organizations are using Hadoop at large scale for applications like search indexing and data analytics.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
This presentation describes how to use perl to run a sample data processing application using the MapReduce framework and gearmand servers.
The original demo code was hosted at the Ann Arbor Perl Mongers web site but moved to github:jpitts/gearman-mapreduce-demo.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
This document summarizes a research paper that proposes using in-node combiners to improve the performance of Hadoop MapReduce jobs. It discusses how MapReduce jobs are I/O intensive and describes two common bottlenecks: during the map phase when data is loaded from disks, and during the shuffle phase when intermediate results are transferred over the network. The paper introduces an in-node combiner approach to optimize I/O by locally aggregating intermediate results within nodes to reduce network traffic between mappers and reducers. It evaluates this approach through an experiment counting word occurrences in Twitter messages.
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
This document presents two variations of a job-driven scheduling scheme called JOSS for efficiently executing MapReduce jobs on remote outsourced data across multiple data centers. The goal of JOSS is to improve data locality for map and reduce tasks, avoid job starvation, and improve job performance. Extensive experiments show that the two JOSS variations, called JOSS-T and JOSS-J, outperform other scheduling algorithms in terms of data locality and network overhead without significant overhead. JOSS-T performs best for workloads of small jobs, while JOSS-J provides the shortest workload time for jobs of varying sizes distributed across data centers.
This document describes a project analyzing the impact of 2 and 4 feet of sea level rise on Broward County, FL land use using GIS. The author used elevation data, land use data, and the raster calculator to determine areas above and below the water levels. Original land use categories were dissolved and converted to raster format. The land use raster was overlaid with the sea level rise rasters to quantify changes. The results found high-rise residential and conservation areas most impacted by 2 feet of rise and almost half the county impacted by 4 feet of rise, showing the area's vulnerability.
The document provides an overview of Hadoop, describing it as an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It discusses key Hadoop components like HDFS for storage, MapReduce for distributed processing, and YARN for resource management. The document also gives examples of how organizations are using Hadoop at large scale for applications like search indexing and data analytics.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
This presentation describes how to use perl to run a sample data processing application using the MapReduce framework and gearmand servers.
The original demo code was hosted at the Ann Arbor Perl Mongers web site but moved to github:jpitts/gearman-mapreduce-demo.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Python in an Evolving Enterprise System (PyData SV 2013)PyData
The document evaluates different solutions for integrating Python with Hadoop to enable data modeling on Hadoop clusters. It tests various frameworks like Native Java, Streaming, mrjob, PyCascading, and Pig using a sample budget aggregation problem. Pig and PyCascading allow complex pipelines to be expressed simply, while Pig is more performant and mature, making it the most viable option for ad-hoc analysis on Hadoop from Python.
HDFS is Hadoop's distributed file system. It has a master-slave architecture with a NameNode master and DataNodes slaves. The NameNode manages file system metadata and DataNodes store data blocks. HDFS is designed for large files and streams data. It replicates blocks across DataNodes for fault tolerance.
Apache Hadoop is an open source software framework for big data. It has two main components: HDFS for distributed storage, and MapReduce as a programming model. The Hadoop ecosystem includes many related projects that run on HDFS and YARN, including Spark, Storm, Hive and Pig. Zookeeper provides centralized coordination services for these projects. Technologies like Cassandra, HBase and MongoDB provide NoSQL database functionality for storing large, unstructured datasets.
The document discusses two papers about MapReduce. The first paper describes Google's implementation of MapReduce (Hadoop) which uses a master-slave model. The second paper proposes a peer-to-peer MapReduce architecture to handle dynamic node failures including master failures. It compares the two approaches, noting that the P2P model provides better fault tolerance against master failures.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.
1. The document discusses concepts related to managing big data using Hadoop including data formats, analyzing data with MapReduce, scaling out, data flow, Hadoop streaming, and Hadoop pipes.
2. Hadoop allows for distributed processing of large datasets across clusters of computers using a simple programming model. It scales out to large clusters of commodity hardware and manages data processing and storage automatically.
3. Hadoop streaming and Hadoop pipes provide interfaces for running MapReduce jobs using any programming language, such as Python or C++, instead of just Java. This allows developers to use the language of their choice.
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
ddR is a package that introduces distributed data structures in R like darray, dframe, and dlist. It provides a standardized API for distributed iteration and data manipulation through functions like dmapply. ddR aims to make distributed computing in R easier to use with good performance by writing algorithms once that can run on different distributed backends like Spark, HPE Distributed R through its unified interface. Evaluation shows ddR algorithms have performance comparable or better than custom implementations and other machine learning libraries.
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMScsandit
In an era where communication has a most important role in modern societies, designing efficient
algorithms for data transmission is of the outmost importance. TDMA is a technology used in many
communication systems such as satellites and cell phones. In order to transmit data in such systems we
need to cluster them in packages. To achieve a faster transmission we are allowed to preempt the
transmission of any packet in order to resume at a later time. Such preemptions though come with a delay
in order to setup for the next transmission. In this paper we propose an algorithm which yields improved
transmission scheduling. This algorithm we call MGA. We have proven an approximation ratio for MGA
and ran experiments to establish that it works even better in practice. In order to conclude that MGA will
be a very helpful tool in constructing an improved schedule for packet routing using preemtion with a setup
cost, we compare its results to two other efficient algorithms designed by researchers in the past.
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
The document discusses MapReduce, a framework for processing large datasets in parallel. It provides an overview of MapReduce's basic principles, surveys research to improve the conventional MapReduce framework, and describes research projects ongoing at KAIST. The key points are that MapReduce provides automatic parallelization, fault tolerance, and distributed processing of large datasets across commodity computer clusters. It also introduces the map and reduce functions that define MapReduce jobs.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.
This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
Silk is a framework for building and running complex workflows of distributed data processing. It allows describing dataflows in Scala in a type safe and concise syntax. Silk translates Scala programs into logical plans and schedules the distributed execution through various "weavers" like an in-memory weaver or Hadoop weaver. It performs static and run-time optimizations of dataflows and supports features like fault tolerance, resource monitoring, and UNIX command integration. The goal of Silk is to enable distributed data analysis for all data scientists through an object-oriented programming model.
The document provides guidance on effectively using PowerPoint for presentations. It discusses:
1) Understanding your audience and venue when designing slides, such as using large fonts that are readable from all seats in the lecture hall.
2) Organizing slides with clear titles and limiting content to maintain audience focus on what is being presented.
3) Testing equipment ahead of time to avoid technical difficulties that could disrupt the presentation.
The document summarizes the major aspects and causes of the French Revolution between 1789-1799. It discusses the economic troubles facing King Louis XVI, the social inequalities of the feudal system, and the influence of philosophers advocating equality, liberty and human rights. The outbreak of the Revolution began with the Tennis Court Oath and internal revolts by the peasants. Key events included abolishing the monarchy, the Reign of Terror, abolishing slavery, and the rise of Napoleon Bonaparte who modernized laws across Europe before his defeat. The Revolution spread ideas of democracy and influenced independence movements around the world.
The document discusses how forest societies and colonialism impacted forests in India. It describes how the British colonial government prioritized commercial forestry through plantations of single tree species. This led to large scale deforestation that cleared natural forests. Local forest dwelling communities struggled as their daily practices like grazing and collecting forest products became illegal under new forest acts and management. There were conflicts over land and resources between colonial commercial interests and local forest dependent populations.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Python in an Evolving Enterprise System (PyData SV 2013)PyData
The document evaluates different solutions for integrating Python with Hadoop to enable data modeling on Hadoop clusters. It tests various frameworks like Native Java, Streaming, mrjob, PyCascading, and Pig using a sample budget aggregation problem. Pig and PyCascading allow complex pipelines to be expressed simply, while Pig is more performant and mature, making it the most viable option for ad-hoc analysis on Hadoop from Python.
HDFS is Hadoop's distributed file system. It has a master-slave architecture with a NameNode master and DataNodes slaves. The NameNode manages file system metadata and DataNodes store data blocks. HDFS is designed for large files and streams data. It replicates blocks across DataNodes for fault tolerance.
Apache Hadoop is an open source software framework for big data. It has two main components: HDFS for distributed storage, and MapReduce as a programming model. The Hadoop ecosystem includes many related projects that run on HDFS and YARN, including Spark, Storm, Hive and Pig. Zookeeper provides centralized coordination services for these projects. Technologies like Cassandra, HBase and MongoDB provide NoSQL database functionality for storing large, unstructured datasets.
The document discusses two papers about MapReduce. The first paper describes Google's implementation of MapReduce (Hadoop) which uses a master-slave model. The second paper proposes a peer-to-peer MapReduce architecture to handle dynamic node failures including master failures. It compares the two approaches, noting that the P2P model provides better fault tolerance against master failures.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.
1. The document discusses concepts related to managing big data using Hadoop including data formats, analyzing data with MapReduce, scaling out, data flow, Hadoop streaming, and Hadoop pipes.
2. Hadoop allows for distributed processing of large datasets across clusters of computers using a simple programming model. It scales out to large clusters of commodity hardware and manages data processing and storage automatically.
3. Hadoop streaming and Hadoop pipes provide interfaces for running MapReduce jobs using any programming language, such as Python or C++, instead of just Java. This allows developers to use the language of their choice.
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
ddR is a package that introduces distributed data structures in R like darray, dframe, and dlist. It provides a standardized API for distributed iteration and data manipulation through functions like dmapply. ddR aims to make distributed computing in R easier to use with good performance by writing algorithms once that can run on different distributed backends like Spark, HPE Distributed R through its unified interface. Evaluation shows ddR algorithms have performance comparable or better than custom implementations and other machine learning libraries.
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMScsandit
In an era where communication has a most important role in modern societies, designing efficient
algorithms for data transmission is of the outmost importance. TDMA is a technology used in many
communication systems such as satellites and cell phones. In order to transmit data in such systems we
need to cluster them in packages. To achieve a faster transmission we are allowed to preempt the
transmission of any packet in order to resume at a later time. Such preemptions though come with a delay
in order to setup for the next transmission. In this paper we propose an algorithm which yields improved
transmission scheduling. This algorithm we call MGA. We have proven an approximation ratio for MGA
and ran experiments to establish that it works even better in practice. In order to conclude that MGA will
be a very helpful tool in constructing an improved schedule for packet routing using preemtion with a setup
cost, we compare its results to two other efficient algorithms designed by researchers in the past.
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
The document discusses MapReduce, a framework for processing large datasets in parallel. It provides an overview of MapReduce's basic principles, surveys research to improve the conventional MapReduce framework, and describes research projects ongoing at KAIST. The key points are that MapReduce provides automatic parallelization, fault tolerance, and distributed processing of large datasets across commodity computer clusters. It also introduces the map and reduce functions that define MapReduce jobs.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.
This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
Silk is a framework for building and running complex workflows of distributed data processing. It allows describing dataflows in Scala in a type safe and concise syntax. Silk translates Scala programs into logical plans and schedules the distributed execution through various "weavers" like an in-memory weaver or Hadoop weaver. It performs static and run-time optimizations of dataflows and supports features like fault tolerance, resource monitoring, and UNIX command integration. The goal of Silk is to enable distributed data analysis for all data scientists through an object-oriented programming model.
The document provides guidance on effectively using PowerPoint for presentations. It discusses:
1) Understanding your audience and venue when designing slides, such as using large fonts that are readable from all seats in the lecture hall.
2) Organizing slides with clear titles and limiting content to maintain audience focus on what is being presented.
3) Testing equipment ahead of time to avoid technical difficulties that could disrupt the presentation.
The document summarizes the major aspects and causes of the French Revolution between 1789-1799. It discusses the economic troubles facing King Louis XVI, the social inequalities of the feudal system, and the influence of philosophers advocating equality, liberty and human rights. The outbreak of the Revolution began with the Tennis Court Oath and internal revolts by the peasants. Key events included abolishing the monarchy, the Reign of Terror, abolishing slavery, and the rise of Napoleon Bonaparte who modernized laws across Europe before his defeat. The Revolution spread ideas of democracy and influenced independence movements around the world.
The document discusses how forest societies and colonialism impacted forests in India. It describes how the British colonial government prioritized commercial forestry through plantations of single tree species. This led to large scale deforestation that cleared natural forests. Local forest dwelling communities struggled as their daily practices like grazing and collecting forest products became illegal under new forest acts and management. There were conflicts over land and resources between colonial commercial interests and local forest dependent populations.
The document discusses search engine optimization (SEO) and provides an overview of how to effectively implement an SEO strategy. It outlines the key steps as understanding the client's business, auditing their website to identify gaps, prioritizing opportunities, implementing changes, measuring results, and iterating the process. The document emphasizes that SEO requires a team approach involving both technical and strategic expertise to properly understand goals, make improvements, and track outcomes.
The document provides case studies of various countries and incidents related to democracy. It analyzes aspects of elections and governance in Pakistan, China, Mexico, Saudi Arabia, Estonia, Fiji, Zimbabwe, Bhutan, Sri Lanka, Nepal, Bihar and Bangladesh. Key aspects of democracy that were violated include lack of free and fair elections, denial of voting rights, manipulation of electoral and legal processes, suppression of opposition, and lack of independent media. Some countries preserved democratic aspects like elections and rule by elected representatives.
This document discusses the meaning and forms of democracy. It explains that in modern democracies, not all citizens directly rule due to the large number of people involved. Instead, a majority is allowed to make decisions on behalf of others through a representative democracy. For a true democracy to exist, the document argues that no citizen should go hungry and all must have equal ability to participate in decision making through education and resources.
The document describes the 7 layers of the OSI model from application layer to physical layer. The application layer allows users to interact with applications like web browsers and email clients. The presentation layer formats data like text, images, video, and sound. The session layer creates and maintains separate sessions for different applications. The transport layer determines whether to send data reliably or unreliably based on importance. The network layer assigns IP addresses and routes data. The data link layer checks MAC addresses and data integrity. The physical layer generates electrical pulses to send data through cables. Data moves through each layer from source to destination PC.
This document provides an overview of major aspects related to World War I, the period between the two world wars, World War II, Nazism, and Adolf Hitler. It covers topics like the emergence of Nazism in Germany, key events in both world wars, Hitler's rise to power, the establishment of his dictatorship, Nazi ideology and policies, and the Holocaust. The document also discusses economic conditions, treaties, resistance movements, and Nazi approaches toward different groups.
The digestive system breaks down food into smaller molecules that can be absorbed and used by the body. It begins with ingestion in the mouth and ends with waste excretion from the anus. Accessory organs like the liver, gallbladder and pancreas produce enzymes, bile and juices that aid in digestion as food passes through the esophagus, stomach, and small and large intestines where nutrients are absorbed and waste is eliminated.
How did you use media technologies in the construction and researchStellaK17
For research and planning, the group watched music videos on YouTube from similar genres to get inspiration for their own music video. They filmed a preliminary video to practice lip syncing and using video editing software. During production, they used their phones to take pictures, record vlogs, and upload notes. They created social media accounts to engage fans and promote their work. Final Cut Express was used for advanced video editing like split screens due to its superior features over iMovie. A variety of software was used to design additional marketing materials.
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
Here are a few reasons why using the reducer as the combiner doesn't work for computing the mean:
1. The reducer expects an iterator of values for a given key. But for the mean, we need to track the sum and count across all values for a key.
2. The reducer is called once per unique key. But to compute the mean, we need to track partial sums and counts across multiple invocations for the same key. There is no way to preserve state between calls to the reducer.
3. The reducer output type needs to match the mapper output type. But for the mean, the mapper emits (key, value) pairs while the reducer would need to emit (key, sum, count
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.
This document provides an introduction to big data and Hadoop. It discusses what big data is, why companies like it (e.g. Walmart uses it to increase online sales), and how Hadoop works. Specifically, it describes the Hadoop Distributed File System (HDFS) architecture, MapReduce algorithm, and common data types in MapReduce like Writables. It also gives tips for optimizing MapReduce code, such as using combiners and partitioners, before concluding with contact details.
This document provides an overview of a Big Data training presentation. It discusses topics that will be covered including uses of Big Data, Hadoop, HDFS architecture, MapReduce, and tips for optimizing MapReduce codes. The presentation introduces key concepts such as what is Big Data, why use Big Data, what is Hadoop, the HDFS and MapReduce architectures, and demonstrates a word count example MapReduce algorithm. Contact details are provided at the end for any questions.
Hadoop and Mapreduce for .NET User GroupCsaba Toth
This document provides an introduction to Hadoop and MapReduce. It discusses big data characteristics and challenges. It provides a brief history of Hadoop and compares it to RDBMS. Key aspects of Hadoop covered include the Hadoop Distributed File System (HDFS) for scalable storage and MapReduce for scalable processing. MapReduce uses a map function to process key-value pairs and generate intermediate pairs, and a reduce function to merge values by key and produce final results. The document demonstrates MapReduce through an example word count program and includes demos of implementing it on Hortonworks and Azure HDInsight.
The document provides an introduction to MapReduce, including:
- MapReduce is a framework for executing parallel algorithms across large datasets using commodity computers. It is based on map and reduce functions.
- Mappers process input key-value pairs in parallel, and outputs are sorted and grouped by the reducers.
- Examples demonstrate how MapReduce can be used for tasks like building indexes, joins, and iterative algorithms.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
The document discusses using MapReduce and NoSQL databases like MongoDB and Accumulo to solve challenges of analyzing large datasets by allowing distributed processing and incremental updates compared to traditional analytical systems. It provides examples of using MapReduce on MongoDB and Accumulo to perform analytics and maintain running aggregates or results. The document also discusses tradeoffs between different approaches and best practices for optimizing performance when using MapReduce and NoSQL databases together.
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
The WordCount and Sort examples demonstrate basic MapReduce algorithms in Hadoop. WordCount counts the frequency of words in a text document by having mappers emit (word, 1) pairs and reducers sum the counts. Sort uses an identity mapper and reducer to simply sort the input files by key. Both examples read from and write to HDFS, and can be run on large datasets to benchmark a Hadoop cluster's sorting performance.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
The document describes Threp, a lightweight remapping framework for use in Earth system models. Threp aims to provide a flexible, readable, and efficient framework for remapping data between different grid types, including regular, rectilinear, curvilinear, and unstructured grids. It supports operations like interpolation, masking, and extrapolation between source and destination grids. Threp uses a two-stage process of first generating interpolation weights and then applying those weights to remap data values. It is designed for parallel computation and to be easily extensible to support new interpolation methods and grid types.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This document provides an overview of MapReduce and Hadoop frameworks. It describes how MapReduce works by dividing data processing into two phases - map and reduce. The map phase processes input data in parallel and produces intermediate key-value pairs, while the reduce phase aggregates the intermediate outputs by key. Hadoop provides an implementation of MapReduce by running tasks on a distributed file system and coordinating execution across clusters.
In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
Adaptive MapReduce using Situation-Aware Mappersrvernica
We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these situation-aware mappers to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.
Similar to Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques" (20)
SATTA MATKA SATTA FAST RESULT KALYAN TOP MATKA RESULT KALYAN SATTA MATKA FAST RESULT MILAN RATAN RAJDHANI MAIN BAZAR MATKA FAST TIPS RESULT MATKA CHART JODI CHART PANEL CHART FREE FIX GAME SATTAMATKA ! MATKA MOBI SATTA 143 spboss.in TOP NO1 RESULT FULL RATE MATKA ONLINE GAME PLAY BY APP SPBOSS
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...BBPMedia1
Nathalie zal delen hoe DEI en ESG een fundamentele rol kunnen spelen in je merkstrategie en je de juiste aansluiting kan creëren met je doelgroep. Door middel van voorbeelden en simpele handvatten toont ze hoe dit in jouw organisatie toegepast kan worden.
Profiles of Iconic Fashion Personalities.pdfTTop Threads
The fashion industry is dynamic and ever-changing, continuously sculpted by trailblazing visionaries who challenge norms and redefine beauty. This document delves into the profiles of some of the most iconic fashion personalities whose impact has left a lasting impression on the industry. From timeless designers to modern-day influencers, each individual has uniquely woven their thread into the rich fabric of fashion history, contributing to its ongoing evolution.
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Final ank Satta Matka Dpbos Final ank Satta Matta Matka 143 Kalyan Matka Guessing Final Matka Final ank Today Matka 420 Satta Batta Satta 143 Kalyan Chart Main Bazar Chart vip Matka Guessing Dpboss 143 Guessing Kalyan night
Best practices for project execution and deliveryCLIVE MINCHIN
A select set of project management best practices to keep your project on-track, on-cost and aligned to scope. Many firms have don't have the necessary skills, diligence, methods and oversight of their projects; this leads to slippage, higher costs and longer timeframes. Often firms have a history of projects that simply failed to move the needle. These best practices will help your firm avoid these pitfalls but they require fortitude to apply.
Call8328958814 satta matka Kalyan result satta guessing➑➌➋➑➒➎➑➑➊➍
Satta Matka Kalyan Main Mumbai Fastest Results
Satta Matka ❋ Sattamatka ❋ New Mumbai Ratan Satta Matka ❋ Fast Matka ❋ Milan Market ❋ Kalyan Matka Results ❋ Satta Game ❋ Matka Game ❋ Satta Matka ❋ Kalyan Satta Matka ❋ Mumbai Main ❋ Online Matka Results ❋ Satta Matka Tips ❋ Milan Chart ❋ Satta Matka Boss❋ New Star Day ❋ Satta King ❋ Live Satta Matka Results ❋ Satta Matka Company ❋ Indian Matka ❋ Satta Matka 143❋ Kalyan Night Matka..
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....Lacey Max
“After being the most listed dog breed in the United States for 31
years in a row, the Labrador Retriever has dropped to second place
in the American Kennel Club's annual survey of the country's most
popular canines. The French Bulldog is the new top dog in the
United States as of 2022. The stylish puppy has ascended the
rankings in rapid time despite having health concerns and limited
color choices.”
Digital Marketing with a Focus on Sustainabilitysssourabhsharma
Digital Marketing best practices including influencer marketing, content creators, and omnichannel marketing for Sustainable Brands at the Sustainable Cosmetics Summit 2024 in New York
Navigating the world of forex trading can be challenging, especially for beginners. To help you make an informed decision, we have comprehensively compared the best forex brokers in India for 2024. This article, reviewed by Top Forex Brokers Review, will cover featured award winners, the best forex brokers, featured offers, the best copy trading platforms, the best forex brokers for beginners, the best MetaTrader brokers, and recently updated reviews. We will focus on FP Markets, Black Bull, EightCap, IC Markets, and Octa.
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.AnnySerafinaLove
This letter, written by Kellen Harkins, Course Director at Full Sail University, commends Anny Love's exemplary performance in the Video Sharing Platforms class. It highlights her dedication, willingness to challenge herself, and exceptional skills in production, editing, and marketing across various video platforms like YouTube, TikTok, and Instagram.
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdfonlyfansmanagedau
In the competitive world of content creation, standing out and maximising revenue on platforms like OnlyFans can be challenging. This is where partnering with an OnlyFans agency can make a significant difference. Here are five key benefits for content creators considering this option:
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...Herman Kienhuis
Presentation by Herman Kienhuis (Curiosity VC) on developments in AI, the venture capital investment landscape and Curiosity VC's approach to investing, at the alumni event of Amsterdam Business School (University of Amsterdam) on June 13, 2024 in Amsterdam.
Best Competitive Marble Pricing in Dubai - ☎ 9928909666Stone Art Hub
Stone Art Hub offers the best competitive Marble Pricing in Dubai, ensuring affordability without compromising quality. With a wide range of exquisite marble options to choose from, you can enhance your spaces with elegance and sophistication. For inquiries or orders, contact us at ☎ 9928909666. Experience luxury at unbeatable prices.
3. Atomic, transactional updates
Guaranteed consistency
Relational Databases are Awesome
Declarative queries
Easy to reason about
Long track record of success
18. Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
19. Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
1000 Windows ME machines running
independent Excel macros
20. Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
1000 Windows ME machines running
independent Excel macros
25. “Shared Nothing” architectures are the
most scalable…
…but most real-world problems require
us to share something…
…so our designs usually have a parallel
part and a serial part
26. The key is to make sure the vast majority
of our work in the cloud is independent and
parallelizable.
27. Amdahl’s Law
1 S : speed improvement
S(N ) = P : ratio of the problem that
(1- P) + P can be parallelized
N N: number of processors
28. MapReduce Primer
Input Data Map Phase Shuffle Reduce
Split 1 Phase
Mapper 1
Split 2 Mapper 2
Reducer 1
Split 3 Mapper 3
Reducer 2
. . .
. . .
. .
Reducer N
Split N Mapper N
29. MapReduce Example: Word Count
Books Map Phase Shuffle Reduce
Count words Phase
per book Sum words
Count words A-C
per book Sum words
. D-E
.
. .
.
Sum words
W-Z
Count words
per book
30. Notice there is still a serial part of the
problem: the of the reducers must be
combined
31. Notice there is still a serial part of the
problem: the of the reducers must be
combined
…but this is much smaller, and can be
handled by a single process
32. Also notice that the network is a shared
resource when processing big data
33. Also notice that the network is a shared
resource when processing big data
So rather than moving data to computation,
we move computation to data.
34. MapReduce Data Locality
Input Data Map Phase Shuffle Reduce
Split 1 Phase
Mapper 1
Split 2 Mapper 2 Reducer 1
Split 3 Mapper 3 Reducer 2
.
.
. .
. .
. .
Reducer N
Split N Mapper N
= a physical machine
36. Data locality is only guaranteed the Map
phase
So the most data-intensive work should be
done in the map, with smaller sets set to
the reducer
37. Data locality is only guaranteed the Map
phase
So the most data-intensive work should be
done in the map, with smaller sets set to the
reducer
Some Map/Reduce jobs have no reducer at
all!
38. MapReduce Gone Wrong
Books Map Phase Shuffle Reduce
Count words Phase
per book Sum words
Count words A-C
per book
Sum words Word
. D-E
. Addition
.
.
. Service
Sum words
W-Z
Count words
per book
39. Even if our Word Addition Service is
scalable, we’d need to scale it to the size of
the largest Map/Reduce job that will ever
use it
40. So for data processing, prefer embedded
libraries over remote services
41. So for data processing, prefer embedded
libraries over remote services
Use remote services for configuration, to
prime caches, etc. – just not for every data
element!
42. Joining a billion records
Word counts are great, but many real-world
problems mean bringing together multiple
datasets.
So how do we “join” with MapReduce?
43. Map-Side Joins
When joining one big input to a small one,
Simply copy the small data set to each mapper
Data Set 1 Map Phase Shuffle Reduce
Mapper 1 Phase
Split 1
Data set 2
Reducer 1
Mapper 2
Split 2
Data set 2 Reducer 2
.
Mapper 3 .
Split 3
Data set 2
44. Merge in Reducer
Route common items to the same reducer
Data Set 1 Map Phase Shuffle Reduce
Split 1 Phase
Group by key
Split 2 Group by key
Reducer 1
Split 3 Group by key
Reducer 2
.
.
Data Set 2
Split 1 Group by key
Reducer N
Split 2 Group by key
Split 3 Group by key
45. Higher-Level Constructs
MapReduce is a primitive operation for
higher-level constructs
Hive, Pig, Cascading, and Crunch all compile
Into MapReduce
Use one!
Crunch!
47. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
48. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
49. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
50. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
Poor support for iterative operations Good support of iterative operations
51. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
Poor support for iterative operations Good support of iterative operations
Arbitrarily complex programs SQL and User-Defined Functions
running next to data running next to data
52. MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases
Oriented towards unstructured Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)
Poor support for iterative operations Good support of iterative operations
Arbitrarily complex programs SQL and User-Defined Functions
running next to data running next to data
Poor interactive query support Good interactive query support
54. MapReduce MPP Databases
…are complementary!
Map/Reduce to clean, normalize, reconcile
and codify data to load into a MPP system
for interactive analysis
57. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
58. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
The name node keeps track of what
blocks belong to what file
59. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
The name node keeps track of what
blocks belong to what file
All blocks written in triplicate
60. Hadoop Distributed Filesystem
Scales to many petabytes
Splits all files into blocks and spreads
them across data nodes
The name node keeps track of what
blocks belong to what file
All blocks written in triplicate
Write and append only –
no random updates!
61. HDFS Writes
Lookup Data Node
Name Node
Client
Write
Data Node 1 Data Node 2 Data Node N
Block Replicate Block Replicate . . . Block
Block Block
62. HDFS Reads
Lookup Block
locations Name Node
Client
Read
Data Node 1 Data Node 2 Data Node N
Block Block ... Block
Block Block
63. HDFS Shortcomings
No random reads
No random writes
Doesn’t deal with many small files
64. HDFS Shortcomings
No random reads
No random writes
Doesn’t deal with many small files
Enter HBase
“Random Access To Your Planet-Size Data”
65. HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files
66. HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files
Files accessible as tables, split across
many regions, hosted by region servers
67. HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files
Files accessible as tables, split across
many regions, hosted by region servers
Preserves scalability, data locality, and
Map/Reduce features of Hadoop
69. Use HBase when:
You have noisy, semi-structured data
You want to apply massively parallel
processing to your problem
70. Use HBase when:
You have noisy, semi-structured data
You want to apply massively parallel
processing to your problem
To handle huge write loads
71. Use HBase when:
You have noisy, semi-structured data
You want to apply massively parallel
processing to your problem
To handle huge write loads
As a scalable key/value store
72. But there are drawbacks:
Limited schema support
Limited atomicity guarantees
No built-in secondary indexes
HBase is a great tool for many jobs,
but not every job
73. The data store should align
with the needs of the application
74. So a pattern is emerging:
Collection Aggregation Processing Storage
Millennium MPP
CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase
75. But we have a potential bottleneck
Collection Aggregation Processing Storage
Millennium MPP
CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase
76. Direct inserts are designed for online
updates, not massively parallel data loads
So shift the work into MapReduce, and pre-
build files for bulk import
Oracle Loader for Hadoop
HBase HFile Import Bulk Loads for MPP
77. And we’re missing an important piece:
Collection Aggregation Processing Storage
Millennium MPP
CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase
78. And we’re missing an important piece:
Collection Aggregation Processing Storage
Millennium MPP
Realtime
Processing
CCDs Relational
Hadoop
with
Claims HBase Document
Map/Red Store
HL7 uce Jobs
(batch)
HBase
79. How do we make it fast?
Speed Layer
Batch Layer
http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
80. How do we make it fast?
Move data to computation
Hours of data
Speed Layer
Incremental
Low Latency (seconds to process) updates
Move computation to data
Years of data
Batch Layer
Bulk loads
High Latency (minutes or hours to process)
81. How do we make it fast?
Complex Event Processing
Speed Layer
Storm
Batch Layer Hadoop
MapReduce
84. Quickly create new data models
Fast iteration cycles means fast innovation
Process all data overnight
Simple correction of any bugs
Much easier to understand and work with