The document discusses parallel computing for econometric analysis using Amazon Web Services. It introduces Hadoop and MapReduce algorithms for distributed processing. It then provides a simple example of using R and Elastic MapReduce on AWS to run regressions in parallel on simulated data and aggregate the results. Various AWS services like EC2, S3, and EMR are described for flexible and scalable cloud computing.
The document discusses MapReduce programs for analyzing weather data. It describes:
1) The MapReduce framework which breaks jobs into map and reduce tasks to process large datasets in parallel across clusters.
2) A sample weather dataset from NOAA containing records with temperature and other weather readings from stations.
3) An example MapReduce program to find the maximum recorded temperature each year from the data using map tasks to extract temperatures and reduce tasks to find the yearly maximum values.
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
Making Big Data Analytics Interactive and Real-TimeSeven Nguyen
Spark is a parallel framework that provides efficient primitives for in-memory data sharing across clusters. It introduces Resilient Distributed Datasets (RDDs) that allow data to be partitioned across nodes and transformed using operations like map and filter. RDDs track the lineage of transformations to allow recovery of lost data partitions. This allows interactive queries and iterative algorithms to run much faster than disk-based systems like MapReduce by avoiding expensive replication and I/O.
The document discusses six writable collection types in Hadoop: ArrayWritable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWritable, and EnumSetWritable. It describes how ArrayWritable and TwoDArrayWritable store arrays and two-dimensional arrays of writable instances. It also discusses how to create custom writable types like TextPair to represent user-defined data types for MapReduce.
Shark is a SQL query engine built on top of Spark, a fast MapReduce-like engine. It extends Spark to support SQL and complex analytics efficiently while maintaining the fault tolerance and scalability of MapReduce. Shark uses techniques from databases like columnar storage and dynamic query optimization to improve performance. Benchmarks show Shark can perform SQL queries and machine learning algorithms faster than traditional MapReduce systems like Hive and Hadoop. The goal of Shark is to provide a unified system for both SQL and complex analytics processing at large scale.
Pig is a tool that makes Hadoop programming easier for non-Java users. It provides a high-level declarative query language called Pig Latin that is compiled into MapReduce jobs. Pig Latin allows users to express data analysis logic without dealing with low-level MapReduce code. It was developed at Yahoo! to help users focus on analyzing large datasets rather than writing Java code. Pig includes features like automatic optimization and extensibility to make it smarter and more useful for developers.
Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The main components of the Pig architecture are the Pig Latin parser and optimizer, which generate a logical plan, and the compiler, which converts this into a physical execution plan of MapReduce jobs. Pig aims to simplify big data analysis for users by hiding the complexity of MapReduce.
The document discusses two papers about MapReduce. The first paper describes Google's implementation of MapReduce (Hadoop) which uses a master-slave model. The second paper proposes a peer-to-peer MapReduce architecture to handle dynamic node failures including master failures. It compares the two approaches, noting that the P2P model provides better fault tolerance against master failures.
The document discusses MapReduce programs for analyzing weather data. It describes:
1) The MapReduce framework which breaks jobs into map and reduce tasks to process large datasets in parallel across clusters.
2) A sample weather dataset from NOAA containing records with temperature and other weather readings from stations.
3) An example MapReduce program to find the maximum recorded temperature each year from the data using map tasks to extract temperatures and reduce tasks to find the yearly maximum values.
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
Making Big Data Analytics Interactive and Real-TimeSeven Nguyen
Spark is a parallel framework that provides efficient primitives for in-memory data sharing across clusters. It introduces Resilient Distributed Datasets (RDDs) that allow data to be partitioned across nodes and transformed using operations like map and filter. RDDs track the lineage of transformations to allow recovery of lost data partitions. This allows interactive queries and iterative algorithms to run much faster than disk-based systems like MapReduce by avoiding expensive replication and I/O.
The document discusses six writable collection types in Hadoop: ArrayWritable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWritable, and EnumSetWritable. It describes how ArrayWritable and TwoDArrayWritable store arrays and two-dimensional arrays of writable instances. It also discusses how to create custom writable types like TextPair to represent user-defined data types for MapReduce.
Shark is a SQL query engine built on top of Spark, a fast MapReduce-like engine. It extends Spark to support SQL and complex analytics efficiently while maintaining the fault tolerance and scalability of MapReduce. Shark uses techniques from databases like columnar storage and dynamic query optimization to improve performance. Benchmarks show Shark can perform SQL queries and machine learning algorithms faster than traditional MapReduce systems like Hive and Hadoop. The goal of Shark is to provide a unified system for both SQL and complex analytics processing at large scale.
Pig is a tool that makes Hadoop programming easier for non-Java users. It provides a high-level declarative query language called Pig Latin that is compiled into MapReduce jobs. Pig Latin allows users to express data analysis logic without dealing with low-level MapReduce code. It was developed at Yahoo! to help users focus on analyzing large datasets rather than writing Java code. Pig includes features like automatic optimization and extensibility to make it smarter and more useful for developers.
Apache Pig is a platform for analyzing large datasets that operates on top of Hadoop. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into MapReduce jobs for execution. The main components of the Pig architecture are the Pig Latin parser and optimizer, which generate a logical plan, and the compiler, which converts this into a physical execution plan of MapReduce jobs. Pig aims to simplify big data analysis for users by hiding the complexity of MapReduce.
The document discusses two papers about MapReduce. The first paper describes Google's implementation of MapReduce (Hadoop) which uses a master-slave model. The second paper proposes a peer-to-peer MapReduce architecture to handle dynamic node failures including master failures. It compares the two approaches, noting that the P2P model provides better fault tolerance against master failures.
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
The document discusses Scalding, an open source Scala-based DSL for Hadoop development. It describes how LinkedIn uses Scalding for various data processing tasks like generating content, targeting, and email experiences. Key benefits of Scalding include its succinct syntax, native Avro support, and ability to integrate with LinkedIn's development tools and processes. The author provides examples of Scalding code used at LinkedIn and discusses opportunities to improve Scalding further.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Apache MapReduce is a programming model and software framework for processing vast amounts of data in parallel. It works by breaking jobs into map and reduce tasks that can be executed in parallel on large clusters. The map tasks take input data and convert it into intermediate key-value pairs, and the reduce tasks combine these intermediate outputs to produce the final results. As an example, a MapReduce job is presented that analyzes weather data to find the maximum recorded temperature for each year, by having mappers extract the year and temperature from records and reducers find the maximum temperature for each year.
Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.
Reading and writing spatial data for the non-spatial programmerChad Cooper
This document discusses how non-spatial programmers can work with spatial data in Python. It notes that spatial data is increasingly important but comes in many formats that can be confusing. It then summarizes several Python libraries that allow reading and writing common spatial data formats like shapefiles, GeoTIFFs, and KML from within Python code. These include libraries like GDAL/OGR, PyShapely, PyProj, and LasPy which provide Python interfaces to spatial data formats and operations.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTrackers.
- The process of launching a MapReduce job is described, from the client submitting the job to the JobTracker distributing tasks to TaskTrackers and running the user-defined mapper and reducer classes.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.
Una persona recibe una llamada de un cartero, pero grita que no quiere recibir nada. Luego suena un grito de sorpresa seguido de varios sonidos de gemidos o quejidos.
- The document outlines a process for turning an essay prompt into a paper, beginning with turning the prompt into a question, then freewriting and narrowing the focus into a preliminary thesis and outline.
- An example prompt is given about the diversity of Standard American English, which is narrowed into a question about how African American Vernacular English and European American Vernacular English differ from SAE.
- A preliminary thesis and outline are then developed from the narrowed question before conducting further research.
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
The document discusses Scalding, an open source Scala-based DSL for Hadoop development. It describes how LinkedIn uses Scalding for various data processing tasks like generating content, targeting, and email experiences. Key benefits of Scalding include its succinct syntax, native Avro support, and ability to integrate with LinkedIn's development tools and processes. The author provides examples of Scalding code used at LinkedIn and discusses opportunities to improve Scalding further.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Apache MapReduce is a programming model and software framework for processing vast amounts of data in parallel. It works by breaking jobs into map and reduce tasks that can be executed in parallel on large clusters. The map tasks take input data and convert it into intermediate key-value pairs, and the reduce tasks combine these intermediate outputs to produce the final results. As an example, a MapReduce job is presented that analyzes weather data to find the maximum recorded temperature for each year, by having mappers extract the year and temperature from records and reducers find the maximum temperature for each year.
Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.
Reading and writing spatial data for the non-spatial programmerChad Cooper
This document discusses how non-spatial programmers can work with spatial data in Python. It notes that spatial data is increasingly important but comes in many formats that can be confusing. It then summarizes several Python libraries that allow reading and writing common spatial data formats like shapefiles, GeoTIFFs, and KML from within Python code. These include libraries like GDAL/OGR, PyShapely, PyProj, and LasPy which provide Python interfaces to spatial data formats and operations.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTrackers.
- The process of launching a MapReduce job is described, from the client submitting the job to the JobTracker distributing tasks to TaskTrackers and running the user-defined mapper and reducer classes.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
Talk given at the 2012 Hadoop Summit in San Jose, CA.
Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.
Follow: twitter.com/scalding
github.com/twitter/scalding
Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.
Una persona recibe una llamada de un cartero, pero grita que no quiere recibir nada. Luego suena un grito de sorpresa seguido de varios sonidos de gemidos o quejidos.
- The document outlines a process for turning an essay prompt into a paper, beginning with turning the prompt into a question, then freewriting and narrowing the focus into a preliminary thesis and outline.
- An example prompt is given about the diversity of Standard American English, which is narrowed into a question about how African American Vernacular English and European American Vernacular English differ from SAE.
- A preliminary thesis and outline are then developed from the narrowed question before conducting further research.
This document outlines Property Assessed Clean Energy (PACE) financing for energy efficient projects in North Little Rock. PACE financing provides property owners access to capital for energy improvements through a voluntary special assessment on their property tax bill. The assessment has priority over other liens and is inherited by new owners if the property is sold. The city would create a PACE district but not be financially obligated. Property owners would pay costs to implement the program and finance improvements through assessments or loans.
This document summarizes research exploring educational change in medical education institutions. It outlines the background, aims, and theoretical frameworks of the research. Several studies are described that examine how educators experience change initiatives, how change agents enact change, and how work groups organize around change. Results indicate stakeholders have different understandings of change, change agents lack systematic approaches, and teachers self-organize around changes when guidance is lacking. The research aims to better understand challenges to change implementation and how to support faculty and leadership in the change process.
This document introduces SWOT analysis and discusses its objectives, components, usefulness, and limitations. The objectives are to describe SWOT analysis, examine its usefulness, and make judgements about its benefits and limitations. SWOT analysis involves identifying the strengths, weaknesses, opportunities, and threats of a business or project. It is useful for setting strategic objectives, understanding a business better, addressing weaknesses, deterring threats, and capitalizing on opportunities. However, SWOT has limitations such as not prioritizing issues, not providing solutions, generating too many ideas, and producing information not all of which is useful.
An operating system manages a computer's memory, processes, and hardware/software. It acts as an intermediary between applications and hardware for functions like input/output and memory allocation. A boot loader loads the operating system into memory from permanent storage. It initializes peripherals enough to obtain the OS. A board support package provides an abstraction of hardware like memory and interrupts to facilitate OS communication and control of a target device.
Pakistan's planning process aims to reduce poverty through economic growth but has struggled with fiscal deficits and declining long-term growth. Strategic environmental assessment is not legally required but is being informally used to incorporate environmental concerns into economic policies and plans. The planning process involves formulation of long, medium, and short-term national plans by working groups consisting of government stakeholders. Projects are then developed and subjected to lengthy approval processes before implementation, but civil society engagement is limited. While regulations for environmental protection have been made, implementation and enforcement have been lax.
La Unión Europea ha propuesto un nuevo paquete de sanciones contra Rusia que incluye un embargo al petróleo. El embargo prohibiría la importación de petróleo ruso a la UE y también impediría el acceso de buques rusos a puertos europeos. Sin embargo, Hungría se opone firmemente al embargo al petróleo, argumentando que su economía depende en gran medida de las importaciones de energía rusa.
The theory of emotional intelligence originated in the 1990s in response to IQ testing. Psychologists Mayer and Salovey introduced the concept in the early 1990s, defining it as the ability to monitor and discriminate between emotions, and use this information to guide thinking and actions. There are five main components: self-awareness, managing emotions, self-motivation, empathy, and handling relationships. The document discusses implications for education, including the importance of communication, understanding, and negotiation skills within and outside the classroom.
The document is discussing different daily activities. It mentions waking up, eating breakfast, going to school or work, eating lunch, drinking tea, eating dinner, watching TV and going to sleep.
Filters work by trapping particles and dirt in water that are larger than water molecules. As water passes through a filter, the dirt and other impurities stick to the filter rather than passing through. Without filters, most water for drinking today would be dirty. While people in developed countries have access to filtered water, people in other parts of the world often do not have ways to clean their water and risk illness or death from drinking dirty water every day.
This document outlines a 3 day 2 night tour of Singapore that includes visits to Marina Bay Sands, Resort World Sentosa, and the Singapore Flyer. The itinerary details the attractions at each location such as the Sands Sky Park and Infinity Pool at Marina Bay Sands, the Universal Studios theme park and S.E.A. Aquarium at Resort World Sentosa, and panoramic views from the Singapore Flyer. Meals include a buffet dinner and Jumbo Seafood restaurant. Transportation and prices are provided for the full tour package.
A brief report of the medical relief work done by Dr Daya and his team in Uttarkashi. Join us on fb to know more: https://www.facebook.com/DoctorsForSevaDFS
Windows 8 is Microsoft's upcoming operating system that follows Windows 7. It features a new tile-based Metro interface designed for touchscreens and supports ARM processors in addition to Intel and AMD chips. A consumer preview was released in February 2012 and the final version of Windows 8 will be available in October 2012. It overhauls the user interface with a Start screen resembling Windows Phone and allows apps to share information between each other.
Formation chancellerie une gestion sensible au genre_mtimmerman_jan 2012Elise Beyst
This document discusses the importance of gender-sensitive management. It argues that diversity and inclusion are critical for organizations to adapt to changing demographics. Failing to focus on these areas could put companies at a competitive disadvantage. The document also highlights research finding that gender-balanced leadership teams can improve company performance. Overall, it promotes taking proactive steps to recruit and promote more women into leadership positions.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
1) The document provides an overview of tools for distributed computing including MapReduce, Hadoop, Hive, and Elastic MapReduce.
2) It discusses getting started with Elastic MapReduce using Python with mrjob or the AWS command line and challenges with getting started with Hive.
3) Potential pitfalls with EMR are also outlined such as JVM memory issues and problems with multiple small output files.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Why Spark Is the Next Top (Compute) ModelDean Wampler
This document contains code for implementing an inverted index using Apache Hadoop MapReduce. The code includes a main class that sets up the job, a LineIndexMapper class containing the map method to tokenize text and output (word, document) pairs, and a LineIndexReducer class containing the reduce method to aggregate values for each word key into a comma-separated string. The code implements the basic MapReduce pattern of mapping input data to intermediate key-value pairs, shuffling, and reducing to output the final index.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document discusses Scala and big data technologies. It provides an overview of Scala libraries for working with Hadoop and MapReduce, including Scalding which provides a Scala DSL for Cascading. It also covers Spark, a cluster computing framework that operates on distributed datasets in memory for faster performance. Additional Scala projects for data analysis using functional programming approaches on Hadoop are also mentioned.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
Spark has evolved its APIs and engine over the last 6 years to combine the best aspects of previous systems like databases, MapReduce, and data frames. Its latest structured APIs like DataFrames provide a declarative interface inspired by data frames in R/Python for ease of use, along with optimizations from databases for performance and future-proofing. This unified approach allows Spark to scale massively like MapReduce while retaining flexibility.
Python can be used for big data applications and processing on Hadoop. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. MapReduce is a programming model used in Hadoop for processing and generating large datasets in a distributed computing environment.
Architecting and productionising data science applications at scalesamthemonad
This document discusses architecting and productionizing data science applications at scale. It covers topics like parallel processing with Spark, streaming platforms like Kafka, and scalable machine learning approaches. It also discusses architectures for data pipelines and productionizing models, with a focus on automation, avoiding SQL databases, and using Kafka streams and Spark for batch and streaming workloads.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
This document provides an overview of MapReduce and Amazon EMR. It begins with an agenda and then discusses Big Data and Hadoop ecosystems. It introduces MapReduce and how it works with HDFS. It then explains how MapReduce can be used on Amazon EMR, including examples. It concludes with a brief discussion of Hadoop 2.0.
Facebook generates large amounts of user data daily from activities like status updates, photo uploads, and shared content. This data is stored in Hadoop using Hive for analytics. Some key facts:
- Facebook adds 4TB of new compressed data daily to its Hadoop cluster.
- The cluster has 4800 cores and 5.5PB of storage across 12TB nodes.
- Hive is used for over 7500 jobs daily and by around 200 analysts monthly.
- Performance improvements to Hive include lazy deserialization, map-side aggregation, and joins.
Hadoop and Hive Development at Facebookelliando dias
Facebook generates large amounts of user data daily from activities like status updates, photo uploads, and shared content. This data is stored in Hadoop using Hive for analytics. Some key facts:
- Facebook adds 4TB of new compressed data daily to its Hadoop cluster.
- The cluster has 4800 cores and 5.5PB of storage across 12TB nodes.
- Hive is used for over 7500 jobs daily and by around 200 engineers/analysts monthly.
- Performance improvements to Hive include lazy deserialization, map-side aggregation, and joins.
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses that Hadoop was created to address the challenges of "Big Data" characterized by high volume, variety and velocity of data. The key components of Hadoop are HDFS for storage and MapReduce as an execution engine for distributed computation. HDFS uses a master-slave architecture with a NameNode master and DataNode slaves, and provides fault tolerance through data replication. MapReduce allows processing of large datasets in parallel through mapping and reducing functions.
Similar to Parallel Computing for Econometricians with Amazon Web Services (20)
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
6. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
7. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
8. Algorithms and Implementations
“Stupidly parallel” - e.g. a for loop where each iteration is
independent.
Only 1 computer? (need 1-8 cores) - use the R multicore
package on a single EC2 node.
Need more? Use Hadoop / MapReduce - can do complicated
mapping and aggregation, in addition to the stupidly parallel
stuff
MapReduce - use Hadoop directly (Java), Hadoop Streaming
(any programming language), rhipe R package (R on
Hadoop).
. . . . . .
9. In this presentation, we will be using Hadoop either directly
through Elastic MapReduce or indirectly via the Segue package for
R
. . . . . .
10. Alternatives
Wait a long time
Use multicores, eg.
http://www.rforge.net/doc/packages/multicore/mclapply.html
Take over the computer lab and start jobs by hand
Buy your own cluster (huge initial cost and will be unutilized
most of the time)
. . . . . .
11. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
12. What is it?
Hadoop is made by the Apache Software Foundation, which
makes open source software. Contributors to the foundation
are both large companies and individuals.
Hadoop Common: The common utilities that support the
other Hadoop subprojects.
HDFS: A distributed file system that provides high throughput
access to application data.
MapReduce: A software framework for distributed processing
of large data sets on compute clusters.
Often, when people say “Hadoop” they mean Hadoop’s
implementation of the map reduce algorithm.
Algorithm made by google. Documented here:
http://labs.google.com/papers/mapreduce.html .
. . . . .
13. What is it for?
Used to process many TB of webserver logs for metrics, target
ad placement, etc
Users include:
Google - calculating pagerank, processing traffic, etc.
Yahoo - > 100,000 CPUs in various clusters, including a 4,000
node cluster. Used for ad placement, etc.
LinkedIn - huge social network graphs - “you may know...”
Amazon - creating product search indices
See: http://wiki.apache.org/hadoop/PoweredBy
. . . . . .
15. Algorithm
The idea is that the job is broken into map and reduce steps.
Mapper processes input and creates chunks
Reducer aggregates the chunks
Hadoop provides a Java implementation of this algorithm.
Features include fault-tolerance, adding nodes on the fly, extreme
speed, and more.
Hadoop is implemented in Java, and Hadoop Streaming allows
mapper and reducers over any language, communicating over
<STDIN>, <STDOUT>.
. . . . . .
17. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
18. What is this cloud?
Cloud computing is the idea of abstracting away from
hardware
All data and computing resources are managed services
Pay per hour, based on need
. . . . . .
19. AWS Overview
Get ready for some acronyms! Amazon Web Services (AWS) is full
of them. The relevant ones are:
EC2 - Elastic Compute Cloud - Dynamically get N computers
for a few cents per hour. Computers range from micro
instances ($ 0.02/hr.) to 8-core, 70GB RAM “quad-xl”
($2.00/hr) to GPU machines ($2.10/hr).
EMR - Elastic map Reduce - automates the instantiation of
Hadoop jobs. Builds the cluster, runs the job, completely in
the background
S3 - Simple Storage Service - Store VERY large objects in the
cloud.
RDS - Relational Database Service - implementation of
MySQL database. Easy way to store data and later load into
R with package RMySQL. E.g.
select date,price from myTable where TICKER=’AMZN’
. . . . . .
21. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
22. Steps
1. Write mapper in R. The output will be aggregatred by
Hadoop’s aggregate function.
2. Create input files
3. Upload all to S3
4. Configure EMR job in AWS Management Console
5. Done!
. . . . . .
23. Files
The directory emr.simpleExample/simpleSimRmapper contains
the following
makeData.R generates 1000 csv files with 1,000,000 rows, 4
columns each. Each file is about 76 MB
fileSplit.sh takes a directory of input files and prepares
them for use with EMR (more on this later)
sjb.simpleMapper.R takes the name of a file from the
command line, gets it from s3, runs a regression, hands back
the coefficients. These coefficients are then aggregated using
aggregate, a standard Hadoop reducer
. . . . . .
24. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
25. Mapper functions
INPUT: <STDIN>. This can be
A seed to a random number generator
Raw data text to process
A list of file names to process - we are doing this one.
OUTPUT: <STDOUT> (print it!), which next goes to the
reducer.
. . . . . .
26. General R Mapper Code Outline
1 t r i m W h i t e S p a c e <− f u n c t i o n ( l i n e ) gsub ( ” ( ˆ +) | ( +$ ) ” , ” ”
, line )
2 con <− f i l e ( ” s t d i n ” , open = ” r ” )
3 w h i l e ( l e n g t h ( l i n e <− r e a d L i n e s ( con , n = 1 , warn =
FALSE ) ) > 0 ) {
4 l i n e <− t r i m W h i t e S p a c e ( l i n e )
5
6 #p r o c e s s and p r i n t r e s u l t s
7 }
8 c l o s e ( con )
. . . . . .
27. Simple Mapper
file: sjb.simpleMapper.R Algorithm:
get the file from s3
read it
run regression
print results in a way that aggregate can read
. . . . . .
29. Overview
1. Made some data with makeData.R
2. Used fileSplit.sh to make lists of files to grab from s3.
These lists will be fed into the mapper. Then transferred the
data and lists to s3. See moveToS3.sh for a list of
commands, but don’t try to run this directly.
3. sjb.simpleMapper.R reads lines. Each line is a file. Opens
the file, does some work, prints some output.
4. Configure job on EMR using AWS Management Console.
Using the standard aggregator to aggregate results.
. . . . . .
30. Numbers
Consider this, in less than 10 min
Instantiated a cluster of 13 m2.xlarge (68.4 GB RAM, 8 cores
each)
Installed Linux OS and Hadoop software on all nodes
Distribute approx. 20GB of data to the nodes
Run some analysis in R
Aggregate the results
Shut down the cluster
. . . . . .
31. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
32. UsefulLinks
Good EMR R Discussion
Hadoop on EMR with C# and F#
Hadoop Aggregate
. . . . . .
33. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
34. Description
From the project website:
Segue has a simple goal: Parallel functionality in R; two
lines of code; in under 15 minutes.
J.D.Long
From segue homepage: http://code.google.com/p/segue/
. . . . . .
35. AWS API - the segue underlying
API stands for Application Program Interface
All Amazon Web Services have API’s, which allow
programmatic access. This exposes many more features than
the AWS Managment Console
For example, through the API one can start and stop a cluster
without adding jobs, add nodes to a running cluster, etc.
Using the API, you can write programs and treat clusters as
the native objects
segue is such a program
. . . . . .
36. segue usage
Segue is ideal for CPU bound applications - e.g. simulations
replaces lapply, which applies a function to elements of a
list, with emrlapply, which distributes the evaluation of the
function to a cluster via Elastic Map Reduce
the list can be anything - seeds to a random number
generator, matrices to invert, data frames to analyse, etc.
. . . . . .
37. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
38. code overview
Note: Code available on my website, http://econsteve.com/r.
Showing 3 levels of optimization:
For loops to matrices
Evaluating firms on multicores
Evaluating firms on multiple computers on EC2
. . . . . .
39. Simulated MLE
We use the simulator
[T ]
∑
N
1∑ ∏
R i
ˆ
ln LNR = ln h(yit |xit , θui
r
R
i=1 r =1 t=1
where i ∈ N is a person among people, or firm in a set of firms. R
√
is a number of of simulations to do, where R ∝ N, and Ti is the
length of the data for firm i.
. . . . . .
40. With for loops - R pseudocode
p a n e l L o g L i k . s i m p l e <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x ) {
l o g L i k <− 0
u i r <− qnorm ( s e e d M a t r i x )
f o r ( n i n 1 :N) {
LiR <− 0 ;
f o r ( r i n 1 : R) {
myProduct <− 1
a l p h a . r <− mu . a + u i r [ r , ( 2 ∗n ) −1] ∗ s i g m a . a
b e t a . r <− mu . b + u i r [ r , ( 2 ∗n ) ] ∗ s i g m a . b
f o r ( t i n 1 : T) {
# f i = r e s i d u a l u s i n g Y , THETA
myProduct <− myProduct ∗ f i
}
LiR <− LiR + myProduct
L i <− LiR /R
l o g L i k <− l o g L i k + l o g ( L i )
} # end f o r r i n R
} # end f o r n
return ( logLik )
}
. . . . . .
41. With for loops - R pseudocode
We then maximize the likelihood function as:
o p t i m R e s <− optim (THETA . i n i t 1 , p a n e l L o g L i k . s i m p l e ,
This is extremely slow on one processor, and does not lend itself to
parallelization. (30 min for 60 firms - didn’t bother to test more).
. . . . . .
42. Opt 1 - matrices, lists, lapply
We adopt a new approach with the following rules:
Structure the data as a list of lists, where each sublist contains
the data, ticker symbol, and uir for the relevant coefficients
Make a firm (i ∈ N) likelihood function, and an outer panel
likelihood function which sums the results of the firms
. . . . . .
43. Opt 1 - matrices, lists, lapply - firm likelihood
# t h i s s h o u l d be an e x t r e m e l y f a s t f i r m L i k e l i h o o d f u n c t i o n
f i r m L i k e l i h o o d <− f u n c t i o n ( d a t a L i s t I t e m , THETA, R) {
s i g m a . e <− THETA [ 1 ] ; mu . a <− THETA [ 2 ] ; s i g m a . a <− THETA
[3]
mu . b <− THETA [ 4 ] ; s i g m a . b <− THETA [ 5 ]
d a t a . n <− d a t a L i s t I t e m $DATA; X . n <− d a t a . n$X ; Y . n <− d a t a .
n$Y ;
T <− nrow ( d a t a . n )
u i r A l p h a <− d a t a L i s t I t e m $UIRALPHA
u i r B e t a <− d a t a L i s t I t e m $UIRBETA
a l p h a . rmat <− mu . a + u i r A l p h a ∗ s i g m a . a
b e t a . rmat <− mu . b + u i r B e t a ∗ s i g m a . b
Y t S t a c k <− re pm at (Y . n , R , 1 )
X t S t a c k <− re pm at (X . n , R , 1 )
r e s i d M a t <− Y t S t a c k − a l p h a . rmat − X t S t a c k ∗ b e t a . rmat
f i t M a t <− ( 1 / ( s i g m a . e ∗ s q r t ( 2 ∗ p i ) ) ) ∗ exp ( −( r e s i d M a t ˆ 2 ) / ( 2
∗ sigma . e ˆ2) )
myProductVec <− a p p l y ( f i t M a t , 1 , pr o d )
L i 2 <− sum ( myProductVec ) /R
return ( Li2 )
}
. . . . . .
44. The list-based outer loop
p a n e l L o g L i k . f a s t e r <− f u n c t i o n (THETA, d a t a L i s t , s e e d M a t r i x )
{
# t h e s e e d m a t r i x a s R rows , and 2 ∗N c o l u m n s w h e r e t h e r e
a r e N f i r m s and 2 p a r a m e t e r s o f i n t e r e s t ( a l p h a and
beta )
u i r <− qnorm ( s e e d M a t r i x )
R <− nrow ( s e e d M a t r i x )
# n o t i c e t h a t we can c a l c u l a t e t h e l i k e l i h o o d s
independently for
# e a c h f i r m , s o we can make a f u n c t i o n and u s e l a p p l y .
T h i s w i l l be
# useful for parallelization
f i r m L i k <− l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)
l o g L i k <− sum ( l o g ( u n l i s t ( f i r m L i k ) ) )
return ( logLik )
}
. . . . . .
45. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
46. The list-based outer loop - multicore
Use the R multicore library, and replace lapply with mclapply at
the outer loop.
library ( multicore )
...
f i r m L i k <− m c l a p p l y ( d a t a L i s t , f i r m L i k e l i h o o d , THETA, R)
This will lead to some substantial speedups.
. . . . . .
47. multicore
N: 200 R: 150 T: 80 logLike: -34951.8 . On 4-core laptop
> proc . time ( )
user syst em e l a p s e d
389.180 36.960 125.674
N: 1000 R: 320 T: 80 logLike: -174621.9. On EC2 2XL
> proc . time ( )
user syst em e l a p s e d
2705.77 2686.08 417.74
N: 5000 R: 710 T: 80 logLike: -870744.4
> proc . time ( )
user system elapsed
16206.480 16067.150 2768.588
multicore can provide quick and easy parallelization. Write
program so that the parallel part is an operation on a list, then
replace lapply with mclapply.
. . . . . .
50. multicore is nice for optimizing a local job.
Most machines today have at least 2 cores. Many have 4 or 8.
However, that is still only 1 machine. Let’s use n of them →
. . . . . .
51. Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
52. installing segue
Install prerequisite packages rjava and catools. On Ubuntu linux:
sudo apt−g e t i n s t a l l r−c r a n −r j a v a r−c r a n −c a t o o l s
Then, download and install segue
http://code.google.com/p/segue/
. . . . . .
53. Using segue
Now in R we do:
> library ( segue )
As we will be using are AWS account, we are going to need to set
credentials so that other people can’t launch clusters in our name.
To get our credentials, go to:
http://aws.amazon.com/account/ and click “Security
Credentials”.
Go back into R.
setCredentials (" ABC123 " ,
" REALLY + LONG +12312312+ STRING +456456")
. . . . . .
54. Firing up the cluster in segue
use the createCluster command.
c r e a t e C l u s t e r ( n u m I n s t a n c e s =2 , c r a n P a c k a g e s ,
filesOnNodes ,
r O b j e c t s O n N o d e s , e n a b l e D e b u g g i n g=FALSE , i n s t a n c e s P e r N o d e ,
m a s t e r I n s t a n c e T y p e=”m1 . s m a l l ” , s l a v e I n s t a n c e T y p e=”m1
. small ” ,
l o c a t i o n=” us−e a s t −1a ” , ec2KeyName , c o p y . image=FALSE ,
otherBootstrapActions , sourcePackagesToInstall )
In our case, lets fire up 10 m2.4xlarge. This gives us 80 cores and
684 GB of RAM to play with.
. . . . . .
55. parallel random number generation
>m y L i s t <− NULL
>s e t . s e e d ( 1 )
>f o r ( i i n 1:10){
a <− c ( rnorm ( 9 9 9 ) , NA)
m y L i s t [ [ i ] ] <− a
}
>o u t p u t L o c a l <− l a p p l y ( m y L i s t , mean , na . rm=T)
>outputEmr <− e m r l a p p l y ( m y C l u s t e r , m y L i s t , mean ,
na . rm=T)
> a l l . e q u a l ( outputEmr , o u t p u t L o c a l )
[ 1 ] TRUE
segue handles this for you. This is very important for simulation.
. . . . . .
56. Monte Carlo π estimation
e s t i m a t e P i <− f u n c t i o n ( s e e d ) {
set . seed ( seed )
numDraws <− 1 e6
r <− . 5 #r a d i u s . . . i n c a s e t h e u n i t c i r c l e i s t o o b o r i n g
x <− r u n i f ( numDraws , min=−r , max=r )
y <− r u n i f ( numDraws , min=−r , max=r )
i n C i r c l e <− i f e l s e ( ( x ˆ2 + y ˆ 2 ) ˆ . 5 < r , 1 , 0 )
r e t u r n ( sum ( i n C i r c l e ) / l e n g t h ( i n C i r c l e ) ∗ 4 )
}
s e e d L i s t <− a s . l i s t ( 1 : 1 0 0 )
r e q u i r e ( segue )
m y E s t i m a t e s <− e m r l a p p l y ( m y C l u s t e r , s e e d L i s t , e s t i m a t e P i )
myPi <− Reduce ( sum , m y E s t i m a t e s ) / l e n g t h ( m y E s t i m a t e s )
> f o r m a t ( myPi , d i g i t s =10)
[ 1 ] ” 3.14166556 ”
. . . . . .
57. parallel MLE
Using code from sml.segue.R on my website. It is exactly the
same as the multicore example, but with the addition of 2 lines to
start the cluster.
. . . . . .
58. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
59. EC2 has GPUs
Cluster GPU Quadruple Extra Large Instance
22 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon
X5570, quad-core Nehalem architecture)
2 x NVIDIA Tesla Fermi M2050 GPUs 1690 GB of instance
storage 64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge
The Fermi chip is important because they have ECC memory, so
simulations are accurate. These are much more robust than gamer
GPUs - cost $2800 per card. Each machine has 2. You can use for
$2.10 per hour.
. . . . . .
60. RHIPE
RHIPE = R and Hadoop Integrated Processing Environment
http://www.stat.purdue.edu/~sguha/rhipe/
Implements rhlapply function
Exposes much more of Hadoop’s underlying functionality,
including the HDFS ⇒
May be better for large data applications
. . . . . .
61. StarCluster I
Allows instantiation of generic clusters on EC2
Use MPI (Message Passing Interface) for much more
complicated parallel programs. E.g., holding one giant matrix
accross the RAM of several nodes
From their page:
Simple configuration with sensible defaults
Single ”start” command to automatically launch and
configure one or more clusters on EC2
Support for attaching and NFS-sharing Amazon Elastic Block
Storage (EBS) volumes for persistent storage across a cluster
Comes with a publicly available Amazon Machine Image
(AMI) configured for scientific computing
AMI includes OpenMPI, ATLAS, Lapack, NumPy, SciPy, and
other useful libraries
. . . . . .
62. StarCluster II
Clusters are automatically configured with NFS, Sun Grid
Engine queuing system, and password-less ssh between
machines
Supports user-contributed ”plugins” that allow users to
perform additional setup routines on the cluster after
StarCluster’s defaults
http://web.mit.edu/stardev/cluster/
. . . . . .
63. Matlab
You can do it in theory, but you need either a license manager
or use Matlab compiler
It will cost you.
Whitepaper from Mathworks: http://www.mathworks.com/
programs/techkits/ec2_paper.html
May be able to coax EMR run a compiled Matlab script, but
you would have to bootstrap each machine to have the
libraries required to run compiled Matlab applications
Mathworks has no incentive to support this behaviour
Requires toolboxes ($$$).
. . . . . .
64. Table of Contents
Tools Overview
Hadoop
Amazon Web Services
A Simple EMR and R Example
The R code - mapper
Resources List
segue and a SML Example
Simulated Maximum Likelihood Example
multicore - on the way to segue
diving into segue
Other EC2 Software Options
Conclusion
. . . . . .
65. EC2 and Hadoop are Extremely Powerful
Huge and active community behind both Hadoop (Apache)
and EC2 (Amazon).
EC2 and AWS in general allow you to change the way you
think about computing resources, as a service rather than as
devices to manage.
New AWS features are always being added
. . . . . .
66. AWS in Education
AMAZON WILL GIVE YOU MONEY
Researcher - send them your proposal, they send you credits,
you thank them in the paper.
Teacher - if you are teaching a class, each student gets $100
credit, good for one year. This would be great for teaching
econometrics, where you can provide a machine image with
software and data already available.
Additionally, AWS for your backups (S3) and other tech needs
. . . . . .
67. Resources
My website http://www.econsteve.com/r for the code in
this presentation
AWS Managment Console
http://aws.amazon.com/console/
AWS Blog http://aws.typepad.com
AWS in Education http://aws.amazon.com/education/
. . . . . .