This document provides a summary of a workshop on using Hadoop and MapReduce for scientific data analysis with an emphasis on matrix computations. The workshop is intended for those interested in MapReduce, Hadoop, and solving problems as matrices. Attendees will learn how MapReduce solves some problems effectively, techniques for solving problems using Hadoop and dumbo, and basic Hadoop terminology. The workshop will not cover the latest MapReduce algorithms, improving Hadoop job performance, or writing the wordcount example in Hadoop. Sample code and tutorials are available online. The workshop agenda includes discussions of HPC vs data computing, MapReduce vs Hadoop, using Hadoop streaming, and sparse matrix methods with Hadoop.
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
LocationTech is an Eclipse Foundation industry working group for location aware technologies. This presentation introduces LocationTech, looks at what it means for our industry and the participating projects.
Libraries: JTS Topology Suite is the rocket science of GIS providing an implementation of Geometry. Mobile Map Tools provides a C++ foundation that is translated into Java and Javascript for maps on iOS, Andriod and WebGL. GeoMesa is a distributed key/value store based on Accumulo. Spatial4j integrates with JTS to provide Geometry on curved surface.
Process: GeoTrellis real-time distributed processing used scala, akka and spark. GeoJinni mixes spatial data/indexing with Hadoop.
Applications: GEOFF offers OpenLayers 3 as a SWT component. GeoGit distributed revision control for feature data. GeoScipt brings spatial data to Groovy, JavaScript, Python and Scala. uDig offers an eclipse based desktop GIS solution.
Attend this presentation if want to know what LocationTech is about, are interested in these projects or curious about what projects will be next.
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
Original GAN 논문 리뷰 및 PyTorch 기반의 구현.
딥러닝 개발환경 및 언어 비교.
[참고]
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.
Wang, Su. "Generative Adversarial Networks (GAN) A Gentle Introduction."
초짜 대학원생의 입장에서 이해하는 Generative Adversarial Networks (https://jaejunyoo.blogspot.com/)
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기 (https://www.slideshare.net/NaverEngineering/1-gangenerative-adversarial-network)
프레임워크 비교(https://deeplearning4j.org/kr/compare-dl4j-torch7-pylearn)
AI 개발에AI 개발에 가장 적합한 5가지 프로그래밍 언어 (http://www.itworld.co.kr/news/109189#csidxf9226c7578dd101b41d03bfedfec05e)
Git는 머꼬? GitHub는 또 머지?(https://www.slideshare.net/ianychoi/git-github-46020592)
svn 능력자를 위한 git 개념 가이드(https://www.slideshare.net/einsub/svn-git-17386752)
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
LocationTech is an Eclipse Foundation industry working group for location aware technologies. This presentation introduces LocationTech, looks at what it means for our industry and the participating projects.
Libraries: JTS Topology Suite is the rocket science of GIS providing an implementation of Geometry. Mobile Map Tools provides a C++ foundation that is translated into Java and Javascript for maps on iOS, Andriod and WebGL. GeoMesa is a distributed key/value store based on Accumulo. Spatial4j integrates with JTS to provide Geometry on curved surface.
Process: GeoTrellis real-time distributed processing used scala, akka and spark. GeoJinni mixes spatial data/indexing with Hadoop.
Applications: GEOFF offers OpenLayers 3 as a SWT component. GeoGit distributed revision control for feature data. GeoScipt brings spatial data to Groovy, JavaScript, Python and Scala. uDig offers an eclipse based desktop GIS solution.
Attend this presentation if want to know what LocationTech is about, are interested in these projects or curious about what projects will be next.
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
Original GAN 논문 리뷰 및 PyTorch 기반의 구현.
딥러닝 개발환경 및 언어 비교.
[참고]
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.
Wang, Su. "Generative Adversarial Networks (GAN) A Gentle Introduction."
초짜 대학원생의 입장에서 이해하는 Generative Adversarial Networks (https://jaejunyoo.blogspot.com/)
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기 (https://www.slideshare.net/NaverEngineering/1-gangenerative-adversarial-network)
프레임워크 비교(https://deeplearning4j.org/kr/compare-dl4j-torch7-pylearn)
AI 개발에AI 개발에 가장 적합한 5가지 프로그래밍 언어 (http://www.itworld.co.kr/news/109189#csidxf9226c7578dd101b41d03bfedfec05e)
Git는 머꼬? GitHub는 또 머지?(https://www.slideshare.net/ianychoi/git-github-46020592)
svn 능력자를 위한 git 개념 가이드(https://www.slideshare.net/einsub/svn-git-17386752)
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Hadoop is a framework for running applications on large clusters built of commodity hardware.The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
Many experts believe that ageing can be delayed, this is one of the main goals of the the Institute of Healthy Ageing at University College London. I will present the results of my lifespan-extension research where we integrated publicly available genes databases in order to identify ageing related genes. I will show what challenges we met and what we have learned about the process of ageing.
Ageing is one of the fundamental mysteries in biology and many scientists are starting to study this fascinating process. I am part of the research group led by Dr Eugene Schuster at UCL Institute of Healthy Ageing. We experiment with Drosophila and Caenorhabditis elegans by modifying their genes in order to create long-lived mutants. The results of our experiments are quantified using high-throughput microarray analysis. Finally we apply information technology in order to understand how the ageing process works. I will show how we mine microarrays data in order to find the connections between thousands of genes and how we identify candidates for ageing genes.
We are interested in building a better understanding of genes functions by harnessing the large quantity of experimental microarray data in the public databases. Our hope is that after understanding the ageing process in simpler organisms we will be able to apply this knowledge in humans.
Cross-referencing expressions levels in thousands of genes and hundreds of experiments turned out to be a computationally challenging problem but Hadoop and Amazon cloud came to our rescue. In this talk I will present a case study based on our use of R with Amazon Elastic MapReduce and will give background on our bioinformatics challenges.
These slides were presented at ApacheCon Europe 2012:
http://www.apachecon.eu/schedule/presentation/3/
When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
Similar to MapReduce for scientific simulation analysis (20)
Correlation clustering and community detection in graphs and networksDavid Gleich
We show a new relationship between various community detection objectives and a correlation clustering framework. These enable us to detect communities with good bounds on the solution.
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
Higher-order organization of complex networksDavid Gleich
A talk I gave at the Park City Institute of Mathematics about our recent work on using motifs to analyze and cluster networks. This involves a higher-order cheeger inequality in terms of motifs.
Spacey random walks and higher-order data analysisDavid Gleich
My talk at TMA 2016 (The workshop on Tensors, Matrices, and their Applications) on the relationship between a spacey random walk process and tensor eigenvectors
A copy of my slides from the SILO Seminar at UW Madison on our recent developments for the NEO-K-Means methods including new optimization routines and results.
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.
Spacey random walks and higher order Markov chainsDavid Gleich
My talk at SIAM NetSci workshop (2015) on our new spacey random walk and spacey random surfer models and how we derived them. There many potential extensions and opportunities to use this for analyzing big data as tensors.
Localized methods in graph mining exploit the local structures in a graph instead attempting to find global structures. These are widely successful at all sorts of problems including community detection, label propagation, and a few others.
PageRank Centrality of dynamic graph structuresDavid Gleich
A talk I gave at the SIAM Annual Meeting Mini-symposium on the mathematics of the power grid organized by Mahantesh Halappanavar. I discuss a few ideas on how our dynamic centrality could help analyze such situations.
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.
Localized methods for diffusions in large graphsDavid Gleich
I describe a few ongoing research projects on diffusions in large graphs and how we can create efficient matrix computations in order to determine them efficiently.
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
We study how Google's PageRank method relates to mincut and a particular type of electrical flow in a network. We also explain the details of how the "push method" for computing PageRank helps to accelerate it. This has implications for semi-supervised learning and machine learning, as well as social network analysis.
Fast relaxation methods for the matrix exponential David Gleich
The matrix exponential is a matrix computing primitive used in link prediction and community detection. We describe a fast method to compute it using relaxation on a large linear system of equations. This enables us to compute a column of the matrix exponential is sublinear time, or under a second on a standard desktop computer.
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
I gave this talk at Netflix about some of the recent work I've been doing on fast matrix primitives for link prediction and also some non-standard uses of the nuclear norm for ranking.
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
I discuss some runtimes for the personalized PageRank vector and how it relates to open questions in how we should tackle these network based measures via matrix computations.
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
A talk at the SIMONS workshop on Parallel and Distributed Algorithms for Inference and Optimization on how to do tall-and-skinny QR factorizations on MapReduce using a communication avoiding algorithm.
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
1. A hands on introduction
to scientific data analysis
with Hadoop !
!
A matrix computations perspective
DAVID F. GLEICH, PURDUE UNIVERSITY
ICME MAPREDUCE WORKSHOP @ STANFORD
1
David Gleich · Purdue
MRWorkshop
2. Who is this for?
workshop project groups
those curious about "
“MapReduce” and “Hadoop”
those who think about "
problems as matrices
2
David Gleich · Purdue
MRWorkshop
3. What should you get out of it?
1. understand some problems that
MapReduce solves effectively.
2. techniques to solve them using
Hadoop and dumbo
3. learn some Hadoop words
3
David Gleich · Purdue
MRWorkshop
4. What you won’t learn …
latest and greatest in "
MapReduce algorithms
how to improve the perform-"
ance of your Hadoop job
how to write wordcount "
in Hadoop
4
David Gleich · Purdue
MRWorkshop
5. Slides will be online soon.
Code samples and short tutorials at
github.com/dgleich/mrmatrix
5
David Gleich · Purdue
MRWorkshop
6. 1. HPC vs. Data (redux)
2. MapReduce vs. Hadoop
3. Dive into Hadoop with
Hadoop streaming
4. Sparse matrix methods "
with Hadoop
6
David Gleich · Purdue
MRWorkshop
10. MapReduce is designed to
solve a different set of problems
10
David Gleich · Purdue
MRWorkshop
11. Supercomputer Data computing cluster Engineer
Each multi-day HPC A data cluster can … enabling engineers to query
simulation generates hold hundreds or thousands and analyze months of simulation
gigabytes of data. of old simulations … data for all sorts of neat purposes.
11
David Gleich · Purdue
MRWorkshop
13. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
13
David Gleich · Purdue
MRWorkshop
14. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
Map function f must be side-effect free
Reduce function g must be side-effect free
14
David Gleich · Purdue
MRWorkshop
15. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
All map functions can be done in parallel
All reduce functions (for key k) can be done
in parallel
15
David Gleich · Purdue
MRWorkshop
16. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Reduce apply a function g to "
all values with key k (for all k)
Output a list of (key, value) pairs
!
Shuffle group all pairs with key k together"
(sorting suffices)
16
David Gleich · Purdue
MRWorkshop
17. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
17
David Gleich · Purdue
MRWorkshop
18. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M
1. Each mapper out- 2. Shuffle moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.
3. Reducers just
compute a numerical
variance.
18
David Gleich · Purdue
MRWorkshop
19. MapReduce vs. Hadoop.
MapReduce! Hadoop!
A computation An implementation
model with:
of MapReduce
Map - a local data using the HDFS
transform
parallel file-system.
Shuffle - a grouping Others !
function
Pheonix++, Twisted,
Google MapReduce,
Reduce – " spark, …
an aggregation
19
David Gleich · Purdue
MRWorkshop
20. Why so many limitations?
20
David Gleich · Purdue
MRWorkshop
21. Data scalability
Maps
M M
1
2
1
M Reduce
2
M M M
R 3
4
3
M
R
4
M M
5
5
M Shuffle
The idea !
Bring the computations to the data
MR can schedule map functions without
moving data.
21
David Gleich · Purdue
MRWorkshop
22. Mesh point variance in MapReduce
Run 1
Run 2
Run 3
T=1
T=2
T=3
T=1
T=2
T=3
T=1
T=2
T=3
M
M
M
1. Each mapper out- 2. Shuffle moves all
puts the mesh points values from the same
with the same key.
mesh point to the
R
R
same reducer.
3. Reducers just
compute a numerical
variance.
Bring the computations
to the data!
22
David Gleich · Purdue
MRWorkshop
23. heartbreak on node rs252
After waiting in the queue for a month and "
after 24 hours of finding eigenvalues, one node randomly hiccups.
23
David Gleich · Purdue
MRWorkshop
24. Fault tolerant
Input stored in triplicate
Reduce input/"
M output on disk
M R
M
R
M
Map output"
persisted to disk"
before shuffle
Redundant input helps make maps data-local
Just one type of communication: shuffle
24
David Gleich · Purdue
MRWorkshop
25. Fault injection
200
Faults (200M by 200)
Time to completion (sec)
With 1/5
tasks failing,
No faults (200M by 200)
the job only
takes twice
100
Faults (800M by 10)
as long.
No faults "
(800M by 10)
10
100
1000
1/Prob(failure) – mean number of success per failure
25
David Gleich · Purdue
MRWorkshop
27. Tools I like
hadoop streaming
dumbo
mrjob
hadoopy
C++
27
David Gleich · Purdue
MRWorkshop
28. Tools I don’t use but other
people seem to like …
pig
java
hbase
Eclipse
Cassandra
28
David Gleich · Purdue
MRWorkshop
29. hadoop streaming
the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to stdout
the reduce function is a program"
(key,value) pairs are sent via stdin"
keys are grouped"
output (key,value) pairs goes to stdout
29
David Gleich · Purdue
MRWorkshop
30. dumbo
a wrapper around hadoop streaming for
map and reduce functions in python
#!/usr/bin/env dumbo
def mapper(key,value):
""" Each record is a line of text.
key=<byte that the line starts in the file>
value=<line of text>
"""
valarray = [float(v) for v in value.split()]
yield key, sum(valarray)
if __name__=='__main__':
import dumbo
import dumbo.lib
dumbo.run(mapper,dumbo.lib.identityreducer)
30
David Gleich · Purdue
MRWorkshop
31. Synthetic data test 100,000,000-by-500 matrix (~500GB)
How can Hadoop streaming
Codes implemented in MapReduce streaming
possibly be fast?
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+Atlas
Custom C++ TypedBytes reader/writer with Atlas
500 GBnon-streaming the R in a QR factorization.
too
New matrix. Computing Java implementation
Iter 1 Iter 1 Iter 2 Overall
QR (secs.) Total (secs.) Total (secs.) Total (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502
C++ in streaming beats a native Java implementation.
All timing results from the Hadoop job tracker
David Gleich (Sandia) MapReduce 2011 16/22
31
David Gleich · Purdue
MRWorkshop
32. Demo 1
1. generate data
2. get data to hadoop
3. run row sums
4. see row sums!
32
David Gleich · Purdue
MRWorkshop
33. How does Hadoop know
key = byte in file"
value = line of text!
!
InputFormat!
Map a file on HDFS to (key,value) pairs
TextInputFormat!
Map a text file to (<byte offset>, <line>)
pairs
33
David Gleich · Purdue
MRWorkshop
34. The Hadoop Distributed File System (HDFS)
and a big text file
HDFS stores files in 64MB chunks
Each chunk is a FileSplit
FileSplits are stored in parallel
A InputFormat converts FileSplits
into a sequence of key-val records
FileSplits can cross record borders"
(a small bit of communication)
34
David Gleich · Purdue
MRWorkshop
35. Tall-and-skinny matrix
storage in MapReduce
A : m x n, m ≫ n
A1
Key is an arbitrary row-id
A2
Value is the 1 x n array "
for a row
A3
A4
Each submatrix Ai is an "
InputSplit (the input to a"
map task).
35
David Gleich · Purdue
MRWorkshop
36. hadoop! MPI!
output row-sum for parallel load
all local rows
for my-batch-of-rows
compute row-sum
parallel save
36
David Gleich · Purdue
MRWorkshop
37. Isn’t reading and writing text
files rather inefficient?
37
David Gleich · Purdue
MRWorkshop
38. Sequence Files and !
OutputFormat
SequenceFile
An internal Hadoop file format to store
(key, value) pairs efficiently. Used between
map and reduce steps.
OutputFormat
Map (key, value) pairs to output on disk
TextOutputFormat
Map (key,value) pairs to keytvalue strings
38
David Gleich · Purdue
MRWorkshop
39. typedbytes
A simple binary serialization scheme.
[<1-byte-type-flag> <binary-value>]*
Roughly equivalent to JSON
(Optionally) used to communicate to and
from Hadoop streaming.
39
David Gleich · Purdue
MRWorkshop
40. typedbytes example
def _read(self):
t = unpack_type(self.file.read(1))[0]
self.t = t
return self.handler_table[t](self)
def read_vector(self):
r = self._read
count = unpack_int(self.file.read(4))[0]
return tuple(r() for i in xrange(count))
40
David Gleich · Purdue
MRWorkshop
42. Column sums in dumbo
#!/usr/bin/env dumbo
def mapper(key,value):
""" Each record is a line of text. """
valarray = [float(v) for v in value.split()]
for col,val in enumerate(valarray):
yield col, val
def reducer(col,values):
yield col, sum(values)
if __name__=='__main__':
import dumbo
import dumbo.lib
dumbo.run(mapper,reducer)
42
David Gleich · Purdue
MRWorkshop
43. Isn’t this just moving the data
to the computation?
MPI!
parallel load
Yes.
for my-batch-of-rows
update sum of each
columns
It seems much"
parallel reduce partial
worse than MPI.
column sums
parallel save
43
David Gleich · Purdue
MRWorkshop
44. The MapReduce
programming model
Input a list of (key, value) pairs
Map apply a function f to all pairs
Combine apply g to local values with key k!
Shuffle group all pairs with key k together!
Reduce apply a function g to "
all values with key k
Output a list of (key, value) pairs
!
44
David Gleich · Purdue
MRWorkshop
45. Column sums in dumbo
#!/usr/bin/env dumbo
def mapper(key,value):
""" Each record is a line of text. """
valarray = [float(v) for v in value.split()]
for col,val in enumerate(valarray):
yield col, val
def reducer(col,values):
yield col, sum(values)
if __name__=='__main__':
import dumbo
import dumbo.lib
dumbo.run(mapper,reducer,combiner=reducer)
45
David Gleich · Purdue
MRWorkshop
46. How many mappers and
reducers?
The number of maps is the number of
InputSplits.
You choose how many reducers.
Each reducer outputs to a separate file.
46
David Gleich · Purdue
MRWorkshop
47. Demo 3
Column sums with multiple
reducers
47
David Gleich · Purdue
MRWorkshop
48. Which reducer does my key
go to?
Partitioner!
Map a given key to a reducer
HashPartitioner!
Randomly distribute keys
48
David Gleich · Purdue
MRWorkshop
50. of a graph, 4 9 storing the matrix by columns corresponds to storing the
1 10 then 7 6
graph as an in-edge list.
13 4 ci 2 3 3 4 2 5 3 6 4 6
Storing a matrix by rows
We briey 14 5
ure ..
3 illustrate compressed row 13 10 12 4 storage schemes 4 g-
ai 16
and column 14 9 20 7 in
0 0 0 Compressed sparse row
16 13 0 Compressed sparse column
0 2 12 4
0 0 rp 1 3 5 7 9 11
0 10 12 cp 1 1 3 6 8 9
0 14 0
11
11
16 20
4 0 0
1
0 0 10 94 9 7
0 20 6
0 ci 2 3 3 4 2 5 6
0 0 4 ri 1 3 1 2 4 2 3 6 4 5
13 4 5 3 4
0 0 7
0 0 ai 16 13 10 12 4 14
0 30 140 5 0
ai 16 4 13 10 9 12 9
7 20
14 7
20 4
4
Row 1 13 0 (3,13.)
16 (2,16.) 0 Row 5 (4,7.) (6,4.)
0 Most graph algorithms0are designed to work with out-edge lists instead of
Compressed sparse column
0 0 10 12 0 0
Row 2 (3,10.) (4,12.)
an algorithm, MatlabBGL 9 11
0 4 lists. Before running cpRow 6
3 6 8 explicitly transposes
in-edge
0 0 14 0
1 1
graph so that Matlab’s internal representation corresponds to storing out-
the 0 9 0 0 20
Row 3 (2,4.) (5,14.)
0 lists. For algorithms symmetric graphs, these transposes are not
0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5
edge on
Row 4 0 0 (6,20.)
ai 16
0 0 (3,9.) 0 0
required. 4 13 10 9 12 7 14 20 4
e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to
50
Matlab’s internal storage of the matrix withoutGleich · Purdue
MRWorkshop
David making a copy. ese functions
51. of a graph, 4 9 storing the matrix by columns corresponds to storing the
1 10 then 7 6
graph as an in-edge list.
13 4 ci 2 3 3 4 2 5 3 6 4 6
Storing a matrix by rows in a text-file
We briey 14 5
ure ..
3 illustrate compressed row 13 10 12 4 storage schemes 4 g-
ai 16
and column 14 9 20 7 in
0 0 0 Compressed sparse row
16 13 0 Compressed sparse column
0 2 12 4
0 0 rp 1 3 5 7 9 11
0 10 12 cp 1 1 3 6 8 9
0 14 0
11
11
16 20
4 0 0
1
0 0 10 94 9 7
0 20 6
0 ci 2 3 3 4 2 5 6
0 0 4 ri 1 3 1 2 4 2 3 6 4 5
13 4 5 3 4
0 0 7
0 0 ai 16 13 10 12 4 14
0 30 140 5 0
ai 16 4 13 10 9 12 9
7 20
14 7
20 4
4
Row 1 13 0 (3,13.)
16 (2,16.) 0 Row 5 (4,7.) (6,4.)
0 Most graph algorithms0are designed to work with out-edge lists instead of
Compressed sparse column
0 0 10 12 0 0
Row 2 (3,10.) (4,12.)
an algorithm, MatlabBGL 9 11
0 4 lists. Before running cpRow 6
3 6 8 explicitly transposes
in-edge
0 0 14 0
1 1
graph so that Matlab’s internal representation corresponds to storing out-
the 0 9 0 0 20
Row 3 (2,4.) (5,14.)
0 lists. For algorithms symmetric graphs, these transposes are not
0 0 0 7 0 4 ri 1 3 1 2 4 2 5 3 4 5
edge on
Row 4 0 0 (6,20.)
ai 16
0 0 (3,9.) 0 0
required. 4 13 10 9 12 7 14 20 4
e mex commands mxGetPr, mxGetJc, and mxGetIr retrieve pointers to
51
Matlab’s internal storage of the matrix withoutGleich · Purdue
MRWorkshop
David making a copy. ese functions
52. To store an m×n sparse matrix M, Matlab uses compressed column format
[Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always
“re-compresses” the data structure in these cases. If M is the adjacency matrix
Sparse matrix-vector product
of a graph, then storing the matrix by columns corresponds to storing the
graph as an in-edge list.
We briey illustrate compressed row and column storage schemes in g-
ure ..
2
X 12 4 The matrix!
Compressed sparse row The vector! row and c
Figure 6.1 – Compressed
rp 1 3 5 7 9 11 11
[Ax]i = Ai,j xj
16 20 storage. At far le, we have a wei
1 10 4 9 7 6 1 (2,16.) (3,13.)
1 2.1
directed graph. Its weighted adjac
13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the
pressed row and compressed colu
3 14 j 5 ai 2 (3,10.) (4,12.)
16 13 10 12 4 14 9 20 7 4 2 -1.3
arrays for this graph and matrix.
sparse matrices, compressed row
0 0 Compressed sparse column
column storage make it easy to ac
0
16 13 0 0
0 cp
3 (2,4.) (5,14.)
3 0.5
entries in rows and columns, resp
0 10 12 0 Consider the rd entry in rp. It sa
0 0
1 1 3 6 8 9 11
4 0 0 14 to look at the th element in ci to
4 (3,9.) (6,20.)
4 0.6
0 20 all the columns in the rd row of
0 9 0 0
0 4 ri 1 3 1 2 4 2 5 3 4 5
matrix. e th and th elements
0 0 7 0
0 0 ai 16 4 13 10 9 12 7 14 20 4
and ai tell us that row has non-
0 0 0 0 5 (4,7.) (6,4.)
5 -1.2
in columns and , with values
. When the sparse matrix corre
to the adjacency matrix of a grap
6
Most graph algorithms are designed to work with out-edge lists instead of 6 0.89
corresponds to ecient access to
out-edges and in-edges of a vertex
in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes
the graph so that Matlab’s internal representation corresponds to storing out- to
To make this work, we need to get the value of the vector
52
edge lists. For algorithms on as the column ofthese matrix
the same function symmetric graphs, the transposes are not
required. David Gleich · Purdue
MRWorkshop
53. To store an m×n sparse matrix M, Matlab uses compressed column format
[Gilbert et al., ]. Matlab never stores a 0 value in a sparse matrix. It always
“re-compresses” the data structure in these cases. If M is the adjacency matrix
Sparse matrix-vector product
of a graph, then storing the matrix by columns corresponds to storing the
graph as an in-edge list.
We briey illustrate compressed row and column storage schemes in g-
ure ..
2
X 12 4 The matrix!
Compressed sparse row The vector! row and c
Figure 6.1 – Compressed
rp 1 3 5 7 9 11 11
[Ax]i = Ai,j xj
16 20 storage. At far le, we have a wei
1 10 4 9 7 6 1 (2,16.) (3,13.)
1 2.1
directed graph. Its weighted adjac
13 4 ci 2 3 3 4 2 5 3 6 4 6 matrix lies below. At right are the
pressed row and compressed colu
3 14 j 5 ai 2 (3,10.) (4,12.)
16 13 10 12 4 14 9 20 7 4 2 -1.3
arrays for this graph and matrix.
sparse matrices, compressed row
0 0 Compressed sparse column
column storage make it easy to ac
0
16 13 0 0
0 cp
3 (2,4.) (5,14.)
3 0.5
entries in rows and columns, resp
0 10 12 0 Consider the rd entry in rp. It sa
0 0
1 1 3 6 8 9 11
4 0 0 14 to look at the th element in ci to
4 (3,9.) (6,20.)
4 0.6
0 20 all the columns in the rd row of
0 9 0 0
0 4 ri 1 3 1 2 4 2 5 3 4 5
matrix. e th and th elements
0 0 7 0
0 0 ai 16 4 13 10 9 12 7 14 20 4
and ai tell us that row has non-
0 0 0 0 5 (4,7.) (6,4.)
5 -1.2
in columns and , with values
. When the sparse matrix corre
to the adjacency matrix of a grap
6
Most graph algorithms are designed to work with out-edge lists instead of 6 0.89
corresponds to ecient access to
out-edges and in-edges of a vertex
in-edge lists. Before running an algorithm, MatlabBGL explicitly transposes
the graph so need to “join” the representationvector based storing out-
We that Matlab’s internal matrix and corresponds to on the column
53
edge lists. For algorithms on symmetric graphs, these transposes are not
required. David Gleich · Purdue
MRWorkshop
54. Sparse matrix-vector product!
takes two MR tasks
Two type
so
Map! records!
f Map!
If vector, emit (row,vecval)
Identity
If matrix,
for each non-zero (row,col,val),
emit (col,(row,val))
One of th
ese
values is
not like Reduce (row, [(Aij xj), …]) !
Reduce! the other
s
Find vecval in input keys
emit (row, sum(Aij xj))
For each (col,(row,val)),
emit (row,(val*vecval))
Form Aij xj for each nonzero
Regroup data by rows, compute sums
54
David Gleich · Purdue
MRWorkshop
55. What about a “dense” row?
Map!
If vector, emit (row,vecval)
If matrix, How do we find
for each non-zero (row,col,val),
emit (col,(row,val))
vecval without
One of th
ese
looking through
values is
Reduce! the other
not like
s
(and buffering) all
Find vecval in input keys
the input?
For each (col,(row,val)),
emit (row,(val*vecval))
Form Aij xj for each nonzero
55
David Gleich · Purdue
MRWorkshop
56. Sparse matrix-vector product!
takes two MR tasks
Two type
so
Map! records!
f
If vector, emit ((row,-1),vecval)
If matrix, Use a custom partitioner
for each non-zero (row,col,val), to make sure that (row,*)
emit ((col,0),(row,val))
all get mapped to the
same reducer, and that
we always see (row,-1)
Reduce!
before (row,0).
Find vecval in input keys
For each (col,(row,val)),
emit (row,(val*vecval))
Form Aij xj for each nonzero
Regroup data by rows, compute sums
56
David Gleich · Purdue
MRWorkshop
60. Algorithm
Data Rows of a matrix
A1 A1 Map QR factorization of rows
A2
qr Reduce QR factorization of rows
A2 Q2 R2
Mapper 1 qr
Serial TSQR A3 A3 Q3 R3
A4 qr emit
A4 Q4 R4
A5 A5
qr
A6 A6 Q6 R6
Mapper 2 qr
Serial TSQR A7 A7 Q7 R7
A8 qr emit
A8 Q8 R8
R4 R4
Reducer 1
Serial TSQR qr emit
R8 R8 Q R
60
David Gleich · Purdue
MRWorkshop
61. In hadoopy
Full code in hadoopy
import random, numpy, hadoopy def close(self):
class SerialTSQR: self.compress()
def __init__(self,blocksize,isreducer): for row in self.data:
key = random.randint(0,2000000000)
self.bsize=blocksize yield key, row
self.data = []
if isreducer: self.__call__ = self.reducer def mapper(self,key,value):
else: self.__call__ = self.mapper self.collect(key,value)
def reducer(self,key,values):
def compress(self): for value in values: self.mapper(key,value)
R = numpy.linalg.qr(
numpy.array(self.data),'r') if __name__=='__main__':
# reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False)
self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True)
for row in R: hadoopy.run(mapper, reducer)
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)self.bsize*len(self.data[0]):
self.compress()
61
David Gleich (Sandia) MapReduceDavid
2011 Gleich · Purdue
MRWorkshop
13/22
62. Related resources
Apache Mahout
Machine learning for Hadoop
… lots of matrices there …
Another fantasic tutorial
http://www.eurecom.fr/~michiard/
teaching/webtech/tutorial.pdf
62
David Gleich · Purdue
MRWorkshop
63. Way too much stuff!
I hope to keep expanding this tutorial
over the week…
Keep checking the git repo.
63
David Gleich · Purdue
MRWorkshop