Map Reduce


Published on

• What is MapReduce?
• What are MapReduce implementations?

Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Map Reduce

  1. 1. MapReduce Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, … April 2012
  2. 2. What is MapReduce? Restricted parallel programming model meant for large clusters – User implements Map() and Reduce() functions Parallel computing framework – Libraries take care of EVERYTHING else • Parallelization • Fault Tolerance • Data Distribution • Load Balancing Useful model for many practical tasks
  3. 3. Map and Reduce The idea of Map, and Reduce is 40+ year old – Present in all Functional Programming Languages. – See, e.g., APL, Lisp and ML Alternate names for Map: Apply-All Higher Order Functions – take function definitions as arguments, or – return a function as output Map and Reduce are higher-order functions.
  4. 4. Map and Reduce Functions Functions borrowed from functional programming languages (eg. Lisp) Map() – Process a key/value pair to generate intermediate key/value pairs Reduce() – Merge all intermediate values associated with the same key
  5. 5. Example: Counting Words Map() – Input <filename, file text> – Parses file and emits <word, count> pairs • eg. <”hello”, 1> Reduce() – Sums all values for the same key and emits <word, TotalCount> • eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
  6. 6. Execution on Clusters 1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return
  7. 7. Map/Reduce Cluster Implementation Input files M map Intermediate tasks files R reduce tasks split 0 split 1 split 2 split 3 split 4 Several map or reduce tasks can run on a single computer Output files Output 0 Output 1 Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition
  8. 8. Map Reduce vs. Parallel Databases Map Reduce widely used for parallel processing – Google, Yahoo, and 100’s of other companies – Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. Database people say: – but parallel databases have been doing this for decades Map Reduce people say: – we operate at scales of 1000’s of machines – We handle failures seamlessly – We allow procedural code in map and reduce and allow data of any type
  9. 9. Typical MapReduce Cluster
  10. 10. Map Reduce Implementations Google – Not available outside Google Hadoop – An open-source implementation in Java – Uses HDFS for stable storage – Download: Teradata Aster – Cluster-optimized SQL Database that also implements MapReduce • IITB alumnus among founders And several others, such as Cassandra at Facebook, etc.
  11. 11. MapReduce v. Hadoop MapReduce Hadoop Org Google Yahoo/Apache Impl C++ Java Distributed GFS File Sys HDFS Data Base Bigtable HBase Distributed Chubby lock mgr ZooKeeper
  12. 12. Solutions Stack for Teradata Aster Data Integration / ETL Business Intelligence Tools Query Tools Analytics Specialists Systems Management Aster Data Ecosystem Security Aster Data nCluster Operating System Servers Cloud Infrastructure Storage Aster Data Platform Infrastructure
  13. 13. Teradata Aster Platform Infrastructure For physical infrastructure (non-cloud) deployments Aster Data Analytic Platform nCluster nCluster Aster Data nCluster packaged software Operating System Certified Linux operating system Server Hardware Certified commodity (x86) server hardware with internal storage
  14. 14. Teradata Aster Infrastructure For cloud deployments Aster Data Analytic Platform nCluster nCluster Aster Data nCluster packaged software Operating System Compute Instance Storage Linux operating system CC CC xLarge xLarge EBS EBS Ephemeral Ephemeral Compute instance from cloud provider (e.g. Amazon Web Services EC2) Storage connected to cloud computing capacity
  15. 15. Teradata Aster Architecture for Analytics Your Analytics & Advanced Reporting Applications App App App App • Support for in-database processing of custom applications written in broad variety of languages • Integration with third-party packaged software via ODBC/JDBC or in-database integration Aster Data nCluster Analytic Functions and Frameworks • Rich libraries of MapReduce analytics from Aster Data and partners • Visual development environment--develop in hours Unified Interface • Standard SQL interface • MapReduce processing integrated with SQL via SQL-MapReduce interface SQL SQL-MapReduce Analytics Processing Engines SQL MapReduce Massively Parallel Data Stores … • Optimized SQL engine • Fully-integrated in-database MapReduce • Hybrid row/column DBMS • Linear, incremental scalability • Commodity hardware
  16. 16. Teradata Aster Ecosystem Partner Product Product release Platform for Certification MicroStrategy Intelligence Server 9.2.1 32-bit Windows 7, Enterprise Edition SP1, 32-bit, 64-bit SAP Business Objects XI 3.1 Windows 2008, 32-bit Informatica Powercenter 9.0.1 Client: Windows 2003/2008 Server 32 bit. Server: Windows 2003/2008 Server 32 bit and 64 bit IBM Cognos 10.1FP1 n/a Tableau Tableau Server 6 Windows (SS: TBU) Microsoft SSLS, SSAS, SSFS, SSIS SQL Server 2008 .NET Framework 2.0 Windows Server, 2008 64-bit Windows 2003, 32-bit *Oracle BIEE certification currently in process