The document summarizes a presentation on using R and Hadoop together. It includes:
1) An outline of topics to be covered including why use MapReduce and R, options for combining R and Hadoop, an overview of RHadoop, a step-by-step example, and advanced RHadoop features.
2) Code examples from Jonathan Seidman showing how to analyze airline on-time data using different R and Hadoop options - naked streaming, Hive, RHIPE, and RHadoop.
3) The analysis calculates average departure delays by year, month and airline using each method.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
This document provides instructions for installing and configuring Hadoop 2.2 on a single node cluster. It describes the new features in Hadoop 2.2 including updated MapReduce framework with Apache YARN, enabling multiple tools to access HDFS concurrently. It then outlines the step-by-step process for downloading Hadoop, configuring environment variables, creating data directories, starting HDFS and YARN processes, and running a sample word count job. Web interfaces for monitoring HDFS and applications are also described.
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
The document discusses integrating R and Hadoop for big data analytics. It notes that existing statistical applications like R are incapable of handling big data, while data management tools lack analytical capabilities. Integrating R with Hadoop bridges this gap by leveraging R's analytics and statistics functionality with Hadoop's ability to process and store distributed data. RHadoop is introduced as an open source project that allows R programmers to directly use MapReduce functionality in R code. Specific RHadoop packages like rhdfs and rmr2 are described that enable interacting with HDFS and performing statistical analysis via MapReduce on Hadoop clusters. Text analytics use cases with R and Hadoop like sentiment analysis are also briefly outlined.
This document discusses Hadoop interview questions and provides resources for preparing for Hadoop interviews. It notes that as demand for Hadoop professionals has increased, Hadoop interviews have become more complex with scenario-based and analytical questions. The document advertises a Hadoop interview guide with over 100 real Hadoop developer interview questions and answers on the website bigdatainterviewquestions.com. It provides examples of common Hadoop questions around debugging jobs, using Capacity Scheduler, benchmarking tools, joins in Pig, analytic functions in Hive, and Hadoop concepts.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
This document provides instructions for installing and configuring Hadoop 2.2 on a single node cluster. It describes the new features in Hadoop 2.2 including updated MapReduce framework with Apache YARN, enabling multiple tools to access HDFS concurrently. It then outlines the step-by-step process for downloading Hadoop, configuring environment variables, creating data directories, starting HDFS and YARN processes, and running a sample word count job. Web interfaces for monitoring HDFS and applications are also described.
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
The document discusses integrating R and Hadoop for big data analytics. It notes that existing statistical applications like R are incapable of handling big data, while data management tools lack analytical capabilities. Integrating R with Hadoop bridges this gap by leveraging R's analytics and statistics functionality with Hadoop's ability to process and store distributed data. RHadoop is introduced as an open source project that allows R programmers to directly use MapReduce functionality in R code. Specific RHadoop packages like rhdfs and rmr2 are described that enable interacting with HDFS and performing statistical analysis via MapReduce on Hadoop clusters. Text analytics use cases with R and Hadoop like sentiment analysis are also briefly outlined.
This document discusses Hadoop interview questions and provides resources for preparing for Hadoop interviews. It notes that as demand for Hadoop professionals has increased, Hadoop interviews have become more complex with scenario-based and analytical questions. The document advertises a Hadoop interview guide with over 100 real Hadoop developer interview questions and answers on the website bigdatainterviewquestions.com. It provides examples of common Hadoop questions around debugging jobs, using Capacity Scheduler, benchmarking tools, joins in Pig, analytic functions in Hive, and Hadoop concepts.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
Owen O'Malley is an architect at Yahoo who works full-time on Hadoop. He discusses Hadoop's origins, how it addresses the problem of scaling applications to large datasets, and its key components including HDFS and MapReduce. Yahoo uses Hadoop extensively, including for building its Webmap and running experiments on large datasets.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses common interview questions about Hadoop Distributed File System (HDFS). It provides explanations for several key HDFS concepts including the essential features of HDFS, streaming access, the roles of the namenode and datanode, heartbeats, blocks, and ways to access and recover files in HDFS. It also covers MapReduce concepts like the jobtracker, tasktracker, task instances, and Hadoop daemons.
Big data interview questions and answersKalyan Hadoop
This document provides an overview of the Hadoop Distributed File System (HDFS), including its goals, design, daemons, and processes for reading and writing files. HDFS is designed for storing very large files across commodity servers, and provides high throughput and reliability through replication. The key components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The Secondary NameNode assists the NameNode in checkpointing filesystem state periodically.
Hadoop is a distributed processing framework for large datasets. It stores data across clusters of commodity hardware in a Hadoop Distributed File System (HDFS) and provides tools for distributed processing using MapReduce. HDFS uses a master-slave architecture with a namenode managing metadata and datanodes storing data blocks. Data is replicated across nodes for reliability. MapReduce allows distributed processing of large datasets in parallel across clusters.
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
Part 1 of 3 of series focusing on the infrastructure aspect of getting started with Big Data, specifically Hadoop. This presentation starts small, installing a pre-packaged virtual machine from Hadoop vendor Cloudera on your local machine.
We then install R, copy some sample data into HDFS and test everything by running Jonathan Seidman's a sample streaming job.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012
The document provides an overview of Hadoop and HDFS. It discusses key concepts such as what big data is, examples of big data, an overview of Hadoop, the core components of HDFS and MapReduce, characteristics of HDFS including fault tolerance and throughput, the roles of the namenode and datanodes, and how data is stored and replicated in blocks in HDFS. It also answers common interview questions about Hadoop and HDFS.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
This document provides an overview of several advanced Hadoop topics, including:
- YARN, the resource manager that allocates resources and manages job scheduling in Hadoop. It uses a global ResourceManager and per-application ApplicationMasters.
- Testing HDFS I/O throughput with TestDFSIO, a tool that measures read and write performance through MapReduce jobs. It reports metrics like throughput and IO rates.
- The mrjob Python library, which provides a framework for writing multi-step MapReduce jobs in Python that can be run locally or on a Hadoop cluster. Sample code demonstrates defining a job class with mapper, reducer, and step methods.
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
The document describes the key limitations of Hadoop 1.x including single point of failure of the NameNode, lack of horizontal scalability, and the JobTracker being overburdened. It then discusses how Hadoop 2.0 addresses these issues through features like HDFS federation for multiple NameNodes, NameNode high availability, and YARN which replaces MapReduce and allows sharing of cluster resources for various workloads.
Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science.
We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
Part 2 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation is geared towards anyone with an occasional need for more computing power.
We walk through the mechanics of launching a instance on Amazon's EC2, install some software (like R and RStudio), and make sure it all works.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012.
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
Part 3 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation demonstrates how to use Apache Whirr to launch a Hadoop cluster on Amazon EC2--easily.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
Owen O'Malley is an architect at Yahoo who works full-time on Hadoop. He discusses Hadoop's origins, how it addresses the problem of scaling applications to large datasets, and its key components including HDFS and MapReduce. Yahoo uses Hadoop extensively, including for building its Webmap and running experiments on large datasets.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses common interview questions about Hadoop Distributed File System (HDFS). It provides explanations for several key HDFS concepts including the essential features of HDFS, streaming access, the roles of the namenode and datanode, heartbeats, blocks, and ways to access and recover files in HDFS. It also covers MapReduce concepts like the jobtracker, tasktracker, task instances, and Hadoop daemons.
Big data interview questions and answersKalyan Hadoop
This document provides an overview of the Hadoop Distributed File System (HDFS), including its goals, design, daemons, and processes for reading and writing files. HDFS is designed for storing very large files across commodity servers, and provides high throughput and reliability through replication. The key components are the NameNode, which manages metadata, and DataNodes, which store data blocks. The Secondary NameNode assists the NameNode in checkpointing filesystem state periodically.
Hadoop is a distributed processing framework for large datasets. It stores data across clusters of commodity hardware in a Hadoop Distributed File System (HDFS) and provides tools for distributed processing using MapReduce. HDFS uses a master-slave architecture with a namenode managing metadata and datanodes storing data blocks. Data is replicated across nodes for reliability. MapReduce allows distributed processing of large datasets in parallel across clusters.
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
Part 1 of 3 of series focusing on the infrastructure aspect of getting started with Big Data, specifically Hadoop. This presentation starts small, installing a pre-packaged virtual machine from Hadoop vendor Cloudera on your local machine.
We then install R, copy some sample data into HDFS and test everything by running Jonathan Seidman's a sample streaming job.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012
The document provides an overview of Hadoop and HDFS. It discusses key concepts such as what big data is, examples of big data, an overview of Hadoop, the core components of HDFS and MapReduce, characteristics of HDFS including fault tolerance and throughput, the roles of the namenode and datanodes, and how data is stored and replicated in blocks in HDFS. It also answers common interview questions about Hadoop and HDFS.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
This document provides an overview of several advanced Hadoop topics, including:
- YARN, the resource manager that allocates resources and manages job scheduling in Hadoop. It uses a global ResourceManager and per-application ApplicationMasters.
- Testing HDFS I/O throughput with TestDFSIO, a tool that measures read and write performance through MapReduce jobs. It reports metrics like throughput and IO rates.
- The mrjob Python library, which provides a framework for writing multi-step MapReduce jobs in Python that can be run locally or on a Hadoop cluster. Sample code demonstrates defining a job class with mapper, reducer, and step methods.
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
The document describes the key limitations of Hadoop 1.x including single point of failure of the NameNode, lack of horizontal scalability, and the JobTracker being overburdened. It then discusses how Hadoop 2.0 addresses these issues through features like HDFS federation for multiple NameNodes, NameNode high availability, and YARN which replaces MapReduce and allows sharing of cluster resources for various workloads.
Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science.
We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
Part 2 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation is geared towards anyone with an occasional need for more computing power.
We walk through the mechanics of launching a instance on Amazon's EC2, install some software (like R and RStudio), and make sure it all works.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012.
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
Part 3 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation demonstrates how to use Apache Whirr to launch a Hadoop cluster on Amazon EC2--easily.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
This document discusses using relational database management systems (RDBMS) in the cloud. It outlines the steps to provision a MySQL database instance on Amazon RDS, including choosing the database and instance type, version, storage, password, input/output options, ports, security groups, and backups. It also mentions using R and RODBC to connect to and analyze data in Cloud MySQL databases, and compares other cloud platforms like Google, Oracle, and Windows for hosting RDBMS and analytics workloads.
The RHive functions rhive.hdfs.connect, rhive.hdfs.ls, rhive.hdfs.get, rhive.hdfs.put, rhive.hdfs.rm, and rhive.hdfs.rename allow users to interact with HDFS from within R without using the Hadoop command line interface or libraries. These functions provide the same functionality as common Hadoop fs commands like hadoop fs -ls, -get, -put, -rm, and -mv to list, get, put, remove, and rename files in HDFS. The document provides examples of how to use each RHive HDFS function to perform operations like listing files, uploading a local file to HDFS, and deleting files
The document discusses running R on Windows Azure. It mentions creating a Windows Azure account with a 90 day free trial and logging into the management portal. It lists 4 types of Linux OS available and instructions for installing Google Chrome, downloading and installing R, and starting work in R.
R hive tutorial supplement 3 - Rstudio-server setup for rhiveAiden Seonghak Hong
This document provides instructions for setting up RStudio Server to use the RHive package for analyzing large datasets with R. It describes downloading and installing RStudio Server, creating user accounts, starting the RStudio Server daemon, and connecting to it via a web browser. It also covers potential issues with RHive environment variables and how to resolve them by setting the variables before or after loading the RHive library. The overall goal is to enable convenient use of RHive through RStudio Server's remote desktop interface and ability to keep R sessions running on the server.
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
This document describes a lightning talk presented at the Greater Boston useR Group in July 2011 about using the googleVis package in R to create motion charts with only one line of code. It discusses Hans Rosling's use of animated charts, how Google incorporated this into their visualization API, and how the googleVis package allows users to leverage this in R. The talk includes examples of creating motion charts in R with googleVis using sample airline data.
My talk at August's joint meeting of Chicago's R and Hadoop user groups providing an introduction to using R with Hadoop. It starts with a quick introduction to and overview of available options, then focuses on using RHadoop's rmr library to perform an analysis on the publicly-available 'airline' data set.
The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.
Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?
This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.
Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.
Overview of how/why to reshape data in R from "wide" (spreadsheet-like) to "long" (database-like) and back.
Focuses on Hadley Wickham's reshape2 package and uses state population data from the 2010 U.S. Census. Also demonstrates use of dcast() to replace table(), etc. to generate crosstabs from a sample market research consumer survey.
Presented at the April 2011 meeting of the Greater Boston useR Group.
Overview of accessing relational databases from R. Focuses and demonstrates DBI family (RMySQL, RPostgreSQL, ROracle, RJDBC, etc.) but also introduces RODBC. Highlights DBI's dbApply() function to combine strengths of SQL and *apply() on large data sets. Demonstrates sqldf package which provides SQL access to standard R data.frames.
Presented at the May 2011 meeting of the Greater Boston useR Group.
Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
R is a programming language and software environment for statistical computing and graphics.
The R language is widely used among statisticians and data miners for developing statistical
software and data analysis.
RStudio IDE is a powerful and productive user interface for R.
It’s free and open source, and available on Windows, Mac, and Linux.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
This document provides an overview of key concepts in statistics for data science, including:
- Descriptive statistics like measures of central tendency (mean, median, mode) and variation (range, variance, standard deviation).
- Common distributions like the normal, binomial, and Poisson distributions.
- Statistical inference techniques like hypothesis testing, t-tests, and the chi-square test.
- Bayesian concepts like Bayes' theorem and how to apply it in R.
- How to use R and RCommander for exploring and visualizing data and performing statistical analyses.
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
This document describes analyzing sentiment towards airlines on Twitter. It searches Twitter for mentions of airlines, collects the tweets, scores the sentiment of each tweet using a simple word counting algorithm, and summarizes the results for each airline. It then compares the Twitter sentiment scores to customer satisfaction scores from the American Customer Satisfaction Index. A linear regression shows a relationship between the Twitter and ACSI scores, suggesting Twitter sentiment analysis can provide insights into customer satisfaction.
This document from the FAA provides a forecast for aviation activity from 2011 to 2031. It predicts substantial growth, with passenger numbers increasing by 560 million and revenue passenger miles more than doubling by 2031. Air traffic operations such as tower operations and aircraft handled are also expected to rise significantly. However, there are risks to the forecast like higher than expected energy prices, a weaker economy producing lower demand, infrastructure constraints at congested airports, increased airline consolidation leading to higher fares, and potential reductions in demand due to climate change. The forecast represents a continued recovery from the impacts of the recession, but more modest growth compared to past recoveries.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
This document provides an agenda for a presentation on big data and big data analytics using R. The presentation introduces the presenter and has sections on defining big data, discussing tools for storing and analyzing big data in R like HDFS and MongoDB, and presenting case studies analyzing social network and customer data using R and Hadoop. The presentation also covers challenges of big data analytics, existing case studies using tools like SAP Hana and Revolution Analytics, and concerns around privacy with large-scale data analysis.
This document provides an introduction to Hadoop, an open-source distributed processing framework. It describes Hadoop as a set of projects with a common goal of processing large datasets in a distributed manner. The key components of Hadoop are HDFS for distributed storage and MapReduce for distributed computing. A Hadoop cluster consists of a master node and multiple slave nodes, with HDFS and MapReduce masters coordinating the storage and processing across the nodes. The document also outlines the Hadoop ecosystem of related projects and gives examples of how MapReduce works and how Hadoop can be used for various types of analysis.
The document provides statistics on the amount of data generated and shared on various digital platforms each day: over 1 terabyte of data from NYSE, 144.8 billion emails sent, 340 million tweets, 684,000 pieces of content shared on Facebook, 72 hours of new video uploaded to YouTube per minute, and more. It outlines the massive scale of data creation and sharing occurring across social media, financial, and other digital platforms.
This presentation provides an overview of big data concepts and Hadoop technologies. It discusses what big data is and why it is important for businesses to gain insights from massive data. The key Hadoop technologies explained include HDFS for distributed storage, MapReduce for distributed processing, and various tools that run on top of Hadoop like Hive, Pig, HBase, HCatalog, ZooKeeper and Sqoop. Popular Hadoop SQL databases like Impala, Presto and Stinger are also compared in terms of their performance and capabilities. The document discusses options for deploying Hadoop on-premise or in the cloud and how to integrate Microsoft BI tools with Hadoop for big data analytics.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...NashvilleTechCouncil
The document discusses an introduction to Hadoop and the Hadoop ecosystem. It begins with definitions of what Hadoop is, including that it is an open-source software framework for distributed storage and processing of large datasets using MapReduce. It then discusses components of Hadoop like HDFS and MapReduce. The document also covers what Hadoop is not intended for. It provides examples of using MapReduce with Python and Pig Latin. Finally, it discusses the broader Hadoop ecosystem and offers tips for getting started with Hadoop.
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
How to use hadoop and r for big data parallel processingBryan Downing
This document provides an overview of Hadoop and how R can be used with Hadoop. It describes what Hadoop is, how it uses MapReduce for parallel processing of big data across clusters, and some key R libraries like rmr, rhadoop, and RHive that allow R code to integrate with Hadoop. It also gives examples of using R for word count, sentiment analysis, and querying data with Hive. Potential use cases for traders are discussed and resources for learning more about Hadoop and these technologies are provided.
This document provides an overview of Hadoop and how R can be used with Hadoop. It describes what Hadoop is, how it uses MapReduce for parallel processing of big data across clusters, and some key R libraries like rmr, rhadoop, and RHive that allow R code to integrate with Hadoop. It also gives examples of using R for word count, sentiment analysis, and querying data with Hive. Finally, it discusses some use cases for traders working with big data and provides resources for learning more about Hadoop.
Szehon Ho gave a presentation on big data technologies at a Meetup in Paris in July 2017. He discussed his background working with big data in Silicon Valley and his current role leading the analytic data storage team at Criteo in Paris. He provided overviews of Hadoop file systems, MapReduce execution, Hive as an interface for accessing Hadoop, and new technologies like Spark and Hive on Spark.
OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM
The document discusses Open Cloud Engine's Flamingo project, an open source big data platform. It provides an overview of Flamingo's features like a file system browser, workflow designer, and monitoring capabilities. These allow users to easily manage and analyze data on Hadoop using a graphical interface rather than writing code. The platform is designed to simplify common big data tasks and support reusable analytic modules.
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
The document provides an overview of the Hadoop ecosystem and how several large companies such as Google, Yahoo, Facebook, and others use Hadoop in production. It discusses the key components of Hadoop including HDFS, MapReduce, HBase, Pig, Hive, Zookeeper and others. It also summarizes some of the large-scale usage of Hadoop at these companies for applications such as web indexing, analytics, search, recommendations, and processing massive amounts of data.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Big Data and NoSQL for Database and BI ProsAndrew Brust
This document provides an agenda and overview for a conference session on Big Data and NoSQL for database and BI professionals held from April 10-12 in Chicago, IL. The session will include an overview of big data and NoSQL technologies, then deeper dives into Hadoop, NoSQL databases like HBase, and tools like Hive, Pig, and Sqoop. There will also be demos of technologies like HDInsight, Elastic MapReduce, Impala, and running MapReduce jobs.
Architecting the Future of Big Data and SearchHortonworks
The document discusses the potential for integrating Apache Lucene and Apache Hadoop technologies. It covers their histories and current uses, as well as opportunities and challenges around making them work better together through tighter integration or code sharing. Developers and businesses are interested in ways to improve searching large amounts of data stored using Hadoop technologies.
Similar to Running R on Hadoop - CHUG - 20120815 (20)
Kinetica is a patented, in-memory, columnar, distributed, GPU-accelerated database. It was originally developed for the US Army to identify terrorist threats in real-time by ingesting and analyzing over 200 sources of streaming data including social media, drones, and cyber data. Kinetica can ingest 200 billion new records per hour and provides real-time, actionable intelligence. It leverages GPUs for high performance and can simultaneously ingest and analyze large volumes of data at scale in real-time.
This document provides an overview of deep learning on GPUs. It discusses how GPUs are well-suited for deep learning and other computationally intensive tasks due to their massively parallel architecture. The document then describes what deep learning is, including different types of neural networks commonly used. It also discusses how deep learning can enhance analytics and big data by automating feature extraction. Examples of running deep learning on Spark clusters using frameworks like TensorFlow on Spark are presented.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL.
Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations.
The video can be found here: http://vimeo.com/chug/using-apache-drill
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
This document summarizes a presentation about using Apache Spark for various data analytics use cases. It discusses how Spark can be used for interactive SQL queries on large datasets, log file enrichment by connecting to data stores like HBase, mixing SQL and machine learning by accessing training and query engines in the same platform, and building recommendation engines by performing ETL, training models with MLlib, and serving recommendations with NoSQL. The presentation argues that Spark helps flatten the adoption curve by providing a unified framework for all these tasks.
This document discusses choosing the right data architecture for big data projects. It begins by acknowledging big data comes in many types, from structured transactional data to unstructured text data. It then presents several big data architectures and platforms that are suitable for different data types and use cases, such as relational databases, NoSQL databases, data grids, and distributed file systems. The document emphasizes that one size does not fit all and the right choice depends on the specific data and business needs.
This document provides an overview of Apache Ambari, an open source framework for provisioning, managing and monitoring Hadoop clusters. It discusses Ambari's architecture and features for provisioning clusters, managing services, monitoring metrics and alerts, and extensibility through Ambari stacks, views and blueprints. The document also outlines Ambari's release cadence and upcoming features around operations, extensibility and troubleshooting insights.
This document discusses security challenges related to big data and Hadoop. It notes that as data grows exponentially, the complexity of managing, securing, and enforcing privacy restrictions on data sets increases. Organizations now need to control access to data scientists based on authorization levels and what data they are allowed to see. Mismanagement of data sets can be costly, as shown by incidents at AOL, Netflix, and a Massachusetts hospital that led to lawsuits and fines. The document then provides a brief history of Hadoop security, noting that it was originally developed without security in mind. It outlines the current Kerberos-centric security model and talks about some vendor solutions emerging to enhance Hadoop security. Finally, it provides guidance on developing security and privacy
The document provides an introduction to MapReduce, including:
- MapReduce is a framework for executing parallel algorithms across large datasets using commodity computers. It is based on map and reduce functions.
- Mappers process input key-value pairs in parallel, and outputs are sorted and grouped by the reducers.
- Examples demonstrate how MapReduce can be used for tasks like building indexes, joins, and iterative algorithms.
This document discusses advanced features and use cases of Oozie including:
1. JMS notifications for job status and SLA notifications
2. Overriding the default launcher for actions like Pig to add custom logic
3. Unit testing Oozie workflows using MiniOozie without a live cluster
Dean Wampler presents on using Scalding, which leverages Cascading, to write MapReduce jobs in a more productive way. Cascading provides higher-level abstractions for building data pipelines and hides much of the boilerplate of the Hadoop MapReduce framework. It allows expressing jobs using concepts like joins and group-bys in a cleaner way focused on the algorithm rather than infrastructure details. Word count is shown implemented in the lower-level MapReduce API versus in Cascading Java code to demonstrate how Cascading minimizes boilerplate and exposes the right abstractions.
Adam Gugliciello, a 15-year veteran in Software Engineering and Systems Architecture specializes in highly available, parallel systems. In this session, we will answer common questions and demonstrate use cases on how Hadoop and Datameer help with asset management and risk management, fraud detection and data security. This presentation was given on January 24th at the CME Group's offices in Chicago, IL.
Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL.
View the HD video of this talk here: http://vimeo.com/chug/oozie-overview
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets by using a new execution engine written in C++ instead of Java and MapReduce. Impala can process queries in milliseconds to hours by distributing query execution across Hadoop clusters. It uses existing Hadoop file formats and metadata but is optimized for performance through techniques like runtime code generation and in-memory processing.
HCatalog provides table management capabilities for Hadoop that allow data to be shared across tools like Pig, Hive, and MapReduce. It exposes metadata about tables, partitions, columns and their properties through a metastore. This allows jobs written in different tools to access the structure and location of the data without needing to declare schemas or know file formats. HCatalog aims to simplify data integration and management in Hadoop.
The document introduces MapReduce 2 and YARN, which were designed to address limitations in MapReduce 1. YARN allows for decoupling of MapReduce processing from cluster resource management, enabling better resource utilization and support for additional applications beyond MapReduce. It separates resource management from job scheduling and processing, with a centralized ResourceManager and per-node NodeManagers. This improves scalability and high availability. The new architecture also allows for containers with variable resource limits rather than fixed slot types.
This document summarizes AdGooroo's experience deploying Hadoop in a Windows environment. Key points include:
- Hadoop and Windows can integrate but require workarounds like NFS for data transfer between Linux and Windows.
- Tools like Hive and Sqoop worked as expected while others like Flume were overkill.
- Unexpected issues arose with data serialization formats like AVRO not being fully compatible between .NET and Java.
- The learning curve is steep but can be flattened by taking things one component at a time.
The document discusses Apache Avro, a data serialization framework. It provides an overview of Avro's history and capabilities. Key points include that Avro supports schema evolution, multiple languages, and interoperability with other formats like Protobuf and Thrift. The document also covers implementing Avro, including using the generic, specific and reflect data types, and examples of writing and reading data. Performance is addressed, finding that Avro size is competitive while speed is in the top half.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Running R on Hadoop - CHUG - 20120815
1. Getting Started with R & Hadoop
Chicago Area Hadoop Users Group
Chicago R Users Group
Boeing Building, Chicago, IL
August 15, 2012
by Jeffrey Breen
President and Co-Founder
http://atms.gr/chirhadoop Atmosphere Research Group
email: jeffrey@atmosgrp.com
Twitter: @JeffreyBreen
2. Outline
• Why MapReduce? Why R?
• R + Hadoop options
• RHadoop overview
• Step-by-step example
• Advanced RHadoop features
3. Outline
• Why MapReduce? Why R?
• R + Hadoop options
• RHadoop overview
• Step-by-step example
• Advanced RHadoop features
4. Why MapReduce? Why R?
• MapReduce is a programming pattern to aid in the parallel analysis of data
• Popularized, but not invented, by Google
• Named from its two primary steps: a “map” phase which picks out the
identifying and subject data (“key” and “value”) and a “reduce” phase
where the values (grouped by key value) are analyzed
• Generally, the programmer/analyst need only write the mapper and
reducer while the system handles the rest
• R is an open source environment for statistical programming and analysis
• Open source and wide platform support makes it easy to try out at
work or “nights and weekends”
• Benefits from an active, growing community
• Offers a (too) large library of add-on packages (see http://
cran.revolutionanalytics.com/)
• Commercial support, extensions, training is available
6. I was wrong about MapReduce
• When the Google paper was published in 2004, I was
running a typical enterprise IT department
• Big hardware (Sun, EMC) + big applications (Siebel,
Peoplesoft) + big databases (Oracle, SQL Server)
= big licensing/maintenance bills
• Loved the scalability, COTS components, and price, but
missed the fact that keys (and values) could be compound
& complex
Source: Hadoop: The Definitive Guide, Second Edition, p. 20
7. And I was wrong about R
• In 1990, my boss (an astronomer) encouraged
me to learn S or S+
• But I knew C, so I resisted, just as I had
successfully fended off the FORTRAN-pushing
physicists
• 20 years later, it’s my go-to tool for anything
data-related
• I rediscovered it when we were looking for a
way to automate the analysis and delivery of
our consumer survey data at Yankee Group
8. Number of R Packages Available
How many R Packages
are there now?
At the command line
enter:
> dim(available.packages())
Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
9. Outline
• Why MapReduce? Why R?
• R + Hadoop options
• RHadoop overview
• Step-by-step example
• Advanced RHadoop features
10. R + Hadoop options
• Hadoop streaming enables the creation of
mappers, reducers, combiners, etc. in languages
other than Java
• Any language which can handle standard, text-
based input & output will do
• R is designed at its heart to deal with data and
statistics making it a natural match for Big Data-
driven analytics
• As a result, there are a number of R packages to
work with Hadoop
11. There’s never just one R package to do anything...
Package Latest Release Comments
(as of 2012-07-09)
misleading name: stands for "Hadoop interactIVE" & has
hive v0.1-15: 2012-06-22
nothing to do with Hadoop hive. On CRAN.
focused on utility functions: I/O parsing, data conversions,
HadoopStreaming v0.2: 2010-04-22
etc. Available on CRAN.
comprehensive: code & submit jobs, access HDFS, etc.
RHIPE v0.69: “11 days ago” Unfortunately, most links to it are broken. Look on github
instead: https://github.com/saptarshiguha/RHIPE/
JD Long’s very clever way to use Amazon EMR with small
segue v0.04: 2012-06-05
or no data. http://code.google.com/p/segue/
rmr 1.2.2: “3 months ago”
Divided into separate packages by purpose:
• rmr - all MapReduce-related functions
RHadoop rhdfs 1.0.3 “2 months ago” • rhdfs - management of Hadoop’s HDFS file system
(rmr, rhdfs, rhbase)
rhbase 1.0.4 “3 months ago”
• rhbase - access to HBase database
Sponsored by Revolution Analytics & on github:
https://github.com/RevolutionAnalytics/RHadoop
12. Any more?
• Yeah, probably. My apologies to the authors of any
relevant packages I may have overlooked.
• R is nothing if it’s not flexible when it comes to
consuming data from other systems
• You could just use R to analyze the output of
any MapReduce workflows
• R can connect via ODBC and/or JDBC, so you
could connect to Hive as if it were just another
database
• So... how to pick?
14. Thanks, Jonathan Seidman
• While Big Data big wig at Orbitz, local hero
Jonathan (now at Cloudera) published
sample code to perform the same analysis
of the airline on-time data set using Hadoop
streaming, RHIPE, hive, and RHadoop’s rmr
https://github.com/jseidman/hadoop-R
• To be honest, I only had to glance at each
sample to make my decision, but let’s take a
look at the code he wrote for each package
15. About the data & Jonathan’s analysis
• Each month, the US DOT publishes details of the on-time performance
(or lack thereof) for every domestic flight in the country
• The ASA’s 2009 Data Expo poster session was based on a cleaned
version spanning 1987-2008, and thus was born the famous “airline” data
set:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,
FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,
Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,
WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2004,1,12,1,623,630,901,915,UA,462,N805UA,98,105,80,-14,-7,ORD,CLT,599,7,11,0,,0,0,0,0,0,0
2004,1,13,2,621,630,911,915,UA,462,N851UA,110,105,78,-4,-9,ORD,CLT,599,16,16,0,,0,0,0,0,0,0
2004,1,14,3,633,630,920,915,UA,462,N436UA,107,105,88,5,3,ORD,CLT,599,4,15,0,,0,0,0,0,0,0
2004,1,15,4,627,630,859,915,UA,462,N828UA,92,105,78,-16,-3,ORD,CLT,599,4,10,0,,0,0,0,0,0,0
2004,1,16,5,635,630,918,915,UA,462,N831UA,103,105,87,3,5,ORD,CLT,599,3,13,0,,0,0,0,0,0,0
[...]
http://stat-computing.org/dataexpo/2009/the-data.html
• Jonathan’s analysis determines the mean departure delay (“DepDelay”)
for each airline for each month
16. “naked” streaming
hadoop-R/airline/src/deptdelay_by_month/R/streaming/map.R
#! /usr/bin/env Rscript
# For each record in airline dataset, output a new record consisting of
# "CARRIER|YEAR|MONTH t DEPARTURE_DELAY"
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
fields <- unlist(strsplit(line, ","))
# Skip header lines and bad records:
if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {
deptDelay <- fields[[16]]
# Skip records where departure dalay is "NA":
if (!(identical(deptDelay, "NA"))) {
# field[9] is carrier, field[1] is year, field[2] is month:
cat(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], sep=""),
"t",
deptDelay, "n")
}
}
}
close(con)
17. “naked” streaming 2/2
hadoop-R/airline/src/deptdelay_by_month/R/streaming/reduce.R
#!/usr/bin/env Rscript
# For each input key, output a record composed of
# YEAR t MONTH t RECORD_COUNT t AIRLINE t AVG_DEPT_DELAY
con <- file("stdin", open = "r")
delays <- numeric(0) # vector of departure delays
lastKey <- ""
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
split <- unlist(strsplit(line, "t"))
key <- split[[1]]
deptDelay <- as.numeric(split[[2]])
# Start of a new key, so output results for previous key:
if (!(identical(lastKey, "")) & (!(identical(lastKey, key)))) {
keySplit <- unlist(strsplit(lastKey, "|"))
cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit[[1]], "t", (mean
(delays)), "n")
lastKey <- key
delays <- c(deptDelay)
} else { # Still working on same key so append dept delay value to vector:
lastKey <- key
delays <- c(delays, deptDelay)
}
}
# We're done, output last record:
keySplit <- unlist(strsplit(lastKey, "|"))
cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit[[1]], "t", (mean
(delays)), "n")
18. hive
hadoop-R/airline/src/deptdelay_by_month/R/hive/hive.R
#! /usr/bin/env Rscript
mapper <- function() {
# For each record in airline dataset, output a new record consisting of
# "CARRIER|YEAR|MONTH t DEPARTURE_DELAY"
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
fields <- unlist(strsplit(line, ","))
# Skip header lines and bad records:
if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {
deptDelay <- fields[[16]]
# Skip records where departure dalay is "NA":
if (!(identical(deptDelay, "NA"))) {
# field[9] is carrier, field[1] is year, field[2] is month:
cat(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], sep=""), "t",
deptDelay, "n")
}
}
}
close(con)
}
reducer <- function() {
con <- file("stdin", open = "r")
delays <- numeric(0) # vector of departure delays
lastKey <- ""
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
split <- unlist(strsplit(line, "t"))
key <- split[[1]]
deptDelay <- as.numeric(split[[2]])
# Start of a new key, so output results for previous key:
if (!(identical(lastKey, "")) & (!(identical(lastKey, key)))) {
keySplit <- unlist(strsplit(lastKey, "|"))
cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit[[1]], "t", (mean(delays)), "n")
lastKey <- key
delays <- c(deptDelay)
} else { # Still working on same key so append dept delay value to vector:
lastKey <- key
delays <- c(delays, deptDelay)
}
}
# We're done, output last record:
keySplit <- unlist(strsplit(lastKey, "|"))
cat(keySplit[[2]], "t", keySplit[[3]], "t", length(delays), "t", keySplit[[1]], "t", (mean(delays)), "n")
}
library(hive)
DFS_dir_remove("/dept-delay-month", recursive = TRUE, henv = hive())
hive_stream(mapper = mapper, reducer = reducer,
input="/data/airline/", output="/dept-delay-month")
results <- DFS_read_lines("/dept-delay-month/part-r-00000", henv = hive())
19. RHIPE
hadoop-R/airline/src/deptdelay_by_month/R/rhipe/rhipe.R
#! /usr/bin/env Rscript
# Calculate average departure delays by year and month for each airline in the
# airline data set (http://stat-computing.org/dataexpo/2009/the-data.html)
library(Rhipe)
rhinit(TRUE, TRUE)
# Output from map is:
# "CARRIER|YEAR|MONTH t DEPARTURE_DELAY"
map <- expression({
# For each input record, parse out required fields and output new record:
extractDeptDelays = function(line) {
fields <- unlist(strsplit(line, ","))
# Skip header lines and bad records:
if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {
deptDelay <- fields[[16]]
# Skip records where departure dalay is "NA":
if (!(identical(deptDelay, "NA"))) {
# field[9] is carrier, field[1] is year, field[2] is month:
rhcollect(paste(fields[[9]], "|", fields[[1]], "|", fields[[2]], sep=""),
deptDelay)
}
}
}
# Process each record in map input:
lapply(map.values, extractDeptDelays)
})
# Output from reduce is:
# YEAR t MONTH t RECORD_COUNT t AIRLINE t AVG_DEPT_DELAY
reduce <- expression(
pre = {
delays <- numeric(0)
},
reduce = {
# Depending on size of input, reduce will get called multiple times
# for each key, so accumulate intermediate values in delays vector:
delays <- c(delays, as.numeric(reduce.values))
},
post = {
# Process all the intermediate values for key:
keySplit <- unlist(strsplit(reduce.key, "|"))
count <- length(delays)
avg <- mean(delays)
rhcollect(keySplit[[2]],
paste(keySplit[[3]], count, keySplit[[1]], avg, sep="t"))
}
)
inputPath <- "/data/airline/"
outputPath <- "/dept-delay-month"
# Create job object:
z <- rhmr(map=map, reduce=reduce,
ifolder=inputPath, ofolder=outputPath,
inout=c('text', 'text'), jobname='Avg Departure Delay By Month',
mapred=list(mapred.reduce.tasks=2))
# Run it:
rhex(z)
20. rmr (1.1)
hadoop-R/airline/src/deptdelay_by_month/R/rmr/deptdelay-rmr.R
#!/usr/bin/env Rscript
# Calculate average departure delays by year and month for each airline in the
# airline data set (http://stat-computing.org/dataexpo/2009/the-data.html).
# Requires rmr package (https://github.com/RevolutionAnalytics/RHadoop/wiki).
library(rmr)
csvtextinputformat = function(line) keyval(NULL, unlist(strsplit(line, ",")))
deptdelay = function (input, output) {
mapreduce(input = input,
output = output,
textinputformat = csvtextinputformat,
map = function(k, fields) {
# Skip header lines and bad records:
if (!(identical(fields[[1]], "Year")) & length(fields) == 29) {
deptDelay <- fields[[16]]
# Skip records where departure dalay is "NA":
if (!(identical(deptDelay, "NA"))) {
# field[9] is carrier, field[1] is year, field[2] is month:
keyval(c(fields[[9]], fields[[1]], fields[[2]]), deptDelay)
}
}
},
reduce = function(keySplit, vv) {
keyval(keySplit[[2]], c(keySplit[[3]], length(vv), keySplit[[1]], mean(as.numeric(vv))))
})
}
from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
22. Other rmr advantages
• Well designed API
• Your code only needs to deal with R objects: strings, lists,
vectors & data.frames
• Very flexible I/O subsystem (new in rmr 1.2, faster in 1.3)
• Handles common formats like CSV
• Allows you to control the input parsing line-by-line without
having to interact with stdin/stdout directly (or even loop)
• The result of the primary mapreduce() function is simply the
HDFS path of the job’s output
• Since one job’s output can be the next job’s input, mapreduce
calls can be daisy-chained to build complex workflows
23. Outline
• Why MapReduce? Why R?
• R + Hadoop options
• RHadoop overview
• Step-by-step example
• Advanced RHadoop features
24. RHadoop overview
• Modular
• Packages group similar functions
• Only load (and learn!) what you need
• Minimizes prerequisites and dependencies
• Open Source
• Cost: Low (no) barrier to start using
• Transparency: Development, issue tracker, Wiki, etc. hosted on
github: https://github.com/RevolutionAnalytics/RHadoop/
• Supported
• Sponsored by Revolution Analytics
• Training & professional services available
25. RHadoop packages
• rhbase - access to HBase database
• rhdfs - interaction with Hadoop’s HDFS file
system
• rmr - all MapReduce-related functions
26. RHadoop prerequisites
• General
• R 2.13.0+, Revolution R 4.3, 5.0
• Cloudera CDH3 Hadoop distribution
• Detailed answer: https://github.com/RevolutionAnalytics/RHadoop/wiki/Which-Hadoop-for-
rmr
• Environment variables
• HADOOP_HOME=/usr/lib/hadoop
• HADOOP_CONF=/etc/hadoop/conf
• HADOOP_CMD=/usr/bin/hadoop
• HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<version>.jar
• rhdfs
• R package: rJava
• rmr
• R packages: RJSONIO (0.95-0 or later), itertools, digest
• rhbase
• Running Thrift server (and its prerequisites)
• see https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase
27. Downloading RHadoop
• Stable and development branches are available on github
• https://github.com/RevolutionAnalytics/RHadoop/
• Releases available as packaged “tarballs”
• https://github.com/RevolutionAnalytics/RHadoop/downloads
• Most current as of August 2012
• https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.1.tar.gz
• https://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.5.tar.gz
• https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gz
• Or pull your own from the master branch
• https://github.com/RevolutionAnalytics/RHadoop/tarball/master
28. Primary rmr functions
• Convenience
• keyval() - creates a key-value pair from any two R
objects. Used to generate output from input
formatters, mappers, reducers, etc.
• Input/output
• from.dfs(), to.dfs() - read/write data from/to the HDFS
• make.input.format() - provides common file parsing
(text, CSV) or will wrap a user-supplied function
• Job execution
• mapreduce() - submit job and return an HDFS path to
the results if successful
29. First, an easy example
Let’s harness the power of our Hadoop cluster...
to square some numbers
library(rmr)
small.ints = 1:1000
small.int.path = to.dfs(1:1000)
out = mapreduce(input = small.int.path,
map = function(k,v) keyval(v, v^2)
)
df = from.dfs( out, to.data.frame=T )
see https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
30. Example output (abridged edition)
> out = mapreduce(input = small.int.path, map = function(k,v) keyval(v, v^2))
12/05/08 10:31:17 INFO mapred.FileInputFormat: Total input paths to process : 1
12/05/08 10:31:18 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-cloudera/
mapred/local]
12/05/08 10:31:18 INFO streaming.StreamJob: Running job: job_201205061032_0107
12/05/08 10:31:18 INFO streaming.StreamJob: To kill this job, run:
12/05/08 10:31:18 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -
Dmapred.job.tracker=ec2-23-22-84-153.compute-1.amazonaws.com:8021 -kill
job_201205061032_0107
12/05/08 10:31:18 INFO streaming.StreamJob: Tracking URL: http://
ec2-23-22-84-153.compute-1.amazonaws.com:50030/jobdetails.jsp?
jobid=job_201205061032_0107
12/05/08 10:31:20 INFO streaming.StreamJob: map 0% reduce 0%
12/05/08 10:31:24 INFO streaming.StreamJob: map 50% reduce 0%
12/05/08 10:31:25 INFO streaming.StreamJob: map 100% reduce 0%
12/05/08 10:31:32 INFO streaming.StreamJob: map 100% reduce 33%
12/05/08 10:31:34 INFO streaming.StreamJob: map 100% reduce 100%
12/05/08 10:31:35 INFO streaming.StreamJob: Job complete: job_201205061032_0107
12/05/08 10:31:35 INFO streaming.StreamJob: Output: /tmp/Rtmpu9IW4I/file744a2b01dd31
> df = from.dfs( out, to.data.frame=T )
> str(df)
'data.frame': 1000 obs. of 2 variables:
$ V1: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2: num 1 4 9 16 25 36 49 64 81 100 ...
see https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
31. Components of basic rmr jobs
• Process raw input with formatters
• see make.input.format()
• Write mapper function in R to extract relevant
key-value pairs
• Perform calculations and analysis in reducer
function written in R
• Submit the job for execution with mapreduce()
• Fetch the results from HDFS with from.dfs()
32. Outline
• Why MapReduce? Why R?
• R + Hadoop options
• RHadoop overview
• Step-by-step example
• Advanced RHadoop features
33. Using rmr: airline enroute time
• Since Hadoop keys and values needn’t be single-valued, let’s pull out a
few fields from the data: scheduled and actual gate-to-gate times and
actual time in the air keyed on year and airport pair
• To review, here’s what the data for a given day (3/25/2004) and
airport pair (BOS & MIA) might look like:
2004,3,25,4,1445,1437,1820,1812,AA,399,N275AA,215,215,197,8,8,BOS,MIA,1258,6,12,0,,0,0,0,0,0,0
2004,3,25,4,728,730,1043,1037,AA,596,N066AA,195,187,170,6,-2,MIA,BOS,1258,7,18,0,,0,0,0,0,0,0
2004,3,25,4,1333,1335,1651,1653,AA,680,N075AA,198,198,168,-2,-2,MIA,BOS,1258,9,21,0,,0,0,0,0,0,0
2004,3,25,4,1051,1055,1410,1414,AA,836,N494AA,199,199,165,-4,-4,MIA,BOS,1258,4,30,0,,0,0,0,0,0,0
2004,3,25,4,558,600,900,924,AA,989,N073AA,182,204,157,-24,-2,BOS,MIA,1258,11,14,0,,0,0,0,0,0,0
2004,3,25,4,1514,1505,1901,1844,AA,1359,N538AA,227,219,176,17,9,BOS,MIA,1258,15,36,0,,0,0,0,15,0,2
2004,3,25,4,1754,1755,2052,2121,AA,1367,N075AA,178,206,158,-29,-1,BOS,MIA,1258,5,15,0,,0,0,0,0,0,0
2004,3,25,4,810,815,1132,1151,AA,1381,N216AA,202,216,180,-19,-5,BOS,MIA,1258,7,15,0,,0,0,0,0,0,0
2004,3,25,4,1708,1710,2031,2033,AA,1636,N523AA,203,203,173,-2,-2,MIA,BOS,1258,4,26,0,,0,0,0,0,0,0
2004,3,25,4,1150,1157,1445,1524,AA,1901,N066AA,175,207,161,-39,-7,BOS,MIA,1258,4,10,0,,0,0,0,0,0,0
2004,3,25,4,2011,1950,2324,2257,AA,1908,N071AA,193,187,163,27,21,MIA,BOS,1258,4,26,0,,0,0,21,6,0,0
2004,3,25,4,1600,1605,1941,1919,AA,2010,N549AA,221,194,196,22,-5,MIA,BOS,1258,10,15,0,,0,0,0,22,0,0
34. rmr 1.2+ input formatter
• The input formatter is called to parse each input line
• in 1.3, speed can be improved by processing batches of lines, but
the idea’s the same
• Jonathan’s code splits CSV file just fine, but we’re going to get fancy and
name the fields of the resulting vector.
• rmr v1.2+’s make.input.format() can wrap your own function:
asa.csvtextinputformat = make.input.format( format = function(line) {
values = unlist( strsplit(line, ",") )
names(values) = c('Year','Month','DayofMonth','DayOfWeek','DepTime',
'CRSDepTime','ArrTime','CRSArrTime','UniqueCarrier',
'FlightNum','TailNum','ActualElapsedTime','CRSElapsedTime',
'AirTime','ArrDelay','DepDelay','Origin','Dest','Distance',
'TaxiIn','TaxiOut','Cancelled','CancellationCode',
'Diverted','CarrierDelay','WeatherDelay','NASDelay',
'SecurityDelay','LateAircraftDelay')
return( keyval(NULL, values) )
} )
https://raw.github.com/jeffreybreen/tutorial-201203-big-data/master/R/functions.R
36. mapper
Note the improved readability due to named fields and the compound key-value
output:
#
# the mapper gets a key and a value vector generated by the formatter
# in our case, the key is NULL and all the field values come in as a vector
#
mapper.year.market.enroute_time = function(key, val) {
# Skip header lines, cancellations, and diversions:
if ( !identical(as.character(val['Year']), 'Year')
& identical(as.numeric(val['Cancelled']), 0)
& identical(as.numeric(val['Diverted']), 0) ) {
# We don't care about direction of travel, so construct 'market'
# with airports ordered alphabetically
# (e.g, LAX to JFK becomes 'JFK-LAX'
if (val['Origin'] < val['Dest'])
market = paste(val['Origin'], val['Dest'], sep='-')
else
market = paste(val['Dest'], val['Origin'], sep='-')
# key consists of year, market
output.key = c(val['Year'], market)
# output gate-to-gate elapsed times (CRS and actual) + time in air
output.val = c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime'])
return( keyval(output.key, output.val) )
}
}
https://raw.github.com/jeffreybreen/tutorial-201203-big-data/master/R/functions.R
38. Hadoop then collects mapper output by key
http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png
39. reducer
For each key, our reducer is called with a list containing all of its values:
#
# the reducer gets all the values for a given key
# the values (which may be multi-valued as here) come in the form of a list()
#
reducer.year.market.enroute_time = function(key, val.list) {
# val.list is a list of row vectors
# a data.frame is a list of column vectors
# plyr's ldply() is the easiest way to convert IMHO
if ( require(plyr) )
val.df = ldply(val.list, as.numeric)
else { # this is as close as my deficient *apply skills can come w/o plyr
val.list = lapply(val.list, as.numeric)
val.df = data.frame( do.call(rbind, val.list) )
}
colnames(val.df) = c('actual','crs','air')
output.key = key
output.val = c( nrow(val.df), mean(val.df$actual, na.rm=T),
mean(val.df$crs, na.rm=T),
mean(val.df$air, na.rm=T) )
return( keyval(output.key, output.val) )
}
https://raw.github.com/jeffreybreen/tutorial-201203-big-data/master/R/functions.R
43. Outline
• Why MapReduce? Why R?
• R + Hadoop options
• RHadoop overview
• Step-by-step example
• Advanced RHadoop features
44. rmr’s local backend
• rmr can simulate a Hadoop cluster on your
local machine
• Just set the ‘backend’ option:
rmr.options.set(backend='local')
• Very handy for development and testing
• You can try installing rmr completely
Hadoop-free, but your mileage may vary
45. RHadoop packages
• rhbase - access to HBase database
• rhdfs - interaction with Hadoop’s HDFS file
system
• rmr - all MapReduce-related functions
46. rhbase function overview
• Initialization
• hb.init()
• Create and manage tables
• hb.list.tables(), hb.describe.table()
• hb.new.table(), hb.delete.table()
• Read and write data
• hb.insert(), hb.insert.data.frame()
• hb.get(), hb.get.data.frame(), hb.scan()
• hb.delete()
• Administrative, etc.
• hb.defaults(), hb.set.table.mode()
• hb.regions.table(), hb.compact.table()