SlideShare a Scribd company logo
About Me - Connor Zanin
● Boston
● Programming in Go ~1 year
● Working with MR since 2014
● MSc from Northeastern
● https://connorzanin.com
● https://github.com/cnnrznn
● https://github.com/gomr
1
GoMR: A MapReduce Framework
for Golang
Connor Zanin
Outline
● Problem
● MapReduce Background
● GoMR
● Evaluation
● GoMR on K8s
Ask Questions!
3
The Problem
● Class on large-scale data processing (MR)
● Implement triangle count on Hadoop
MapReduce
○ 2 JOIN ops
○ Large intermediate tables
5
The Problem
● Configurations
● Error messages
● Performance
8
Background:
MapReduce
● Invention
● Components
● Example
9
What is MapReduce?
Jeff Dean, Sanjay Ghemawat @ Google
MapReduce: Simplified data processing on large clusters
● “Hundreds of special-purpose computations that
process large amounts of raw data”
● Simple computations
● Complex distribution + fault tolerance code
10
What is MapReduce?
“MapReduce is a programming model and
implementation for processing … big data sets with a
parallel, distributed algorithm on a cluster”
- Wikipedia
11
MapReduce - Definitions
● Map: Transform input elements to
<Key, Value> pairs
● Reduce: Transform all Values per Key
12
MapReduce – Word Count
[cat, dog, cat]
[(cat, 1), (dog, 1), (cat, 1)]
x => (x, 1) Map
Reduce
𝑘
𝑣
13
[(cat, 2), (dog, 1)]
MapReduce - Pipeline
14
Input Map Reduce Output
x => (x, 1) ∑
[cat, dog, cat]
[(cat, 2),
(dog, 1)]
MapReduce - Parallel Pipeline
Input
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Output
15
MapReduce - Parallel Pipeline
16
Cat
Cat
Dog
(Cat,1)
(Dog,1)
(Cat,1)
MapReduce - Parallel Pipeline
Input
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Output
17
?
MapReduce - Definitions
● Map: Transform input elements to
<Key, Value> pairs
● Reduce: Transform all Values per Key
● Partition: Group <Key, Value> pairs by Key
18
MapReduce - Parallel Pipeline
19
Cat
Cat
Dog
(Cat, 1)
(Dog, 1)
(Cat, 1)
(Cat, 2)
GoMR
● Goals
● Design
● Implementation
21
GoMR - Goal
● Minimal configuration
● Simple error messages
● Memory, CPU efficient
22
Let’s
GoMR - Pipeline
23
Goroutines
Channels
Input
Mapper
Mapper
Reducer
Reducer
Output
Partitioner
Partitioner
GoMR - Interfaces
type Mapper interface {
Map(in <-chan interface{},
out chan<- interface{})
}
type Partitioner interface {
Partition(in <-chan interface{},
outs []chan interface{},
wg *sync.WaitGroup)
}
type Reducer interface {
Reduce(in <-chan interface{},
out chan<- interface{},
wg *sync.WaitGroup)
}
24
Mapper
Partitioner
Reducer
GoMR - Interfaces
type Job interface {
Mapper
Partitioner
Reducer
}
25
GoMR - Wiring
inChansMap = make([]chan interface{}, numMappers)
inChansPartition := make([]chan interface{}, numMappers)
inChansReduce := make([]chan interface{}, numReducers)
outChanReduce = make(chan interface{}, CHANBUF)
26
Input
Mapper
Mapper
Reducer
Reducer
Output
Partitioner
Partitioner
1
1 2
2
3
4
3 4
Input
Mapper
Mapper
Reducer
Reducer
Output
Partitioner
Partitioner
GoMR – Channels & Goroutines
for i := 0; i < numReducers; i++ {
inChansReduce[i] = make(chan interface{}, CHANBUF)
go j.Reduce(inChansReduce[i], outChanReduce, &wgReduce)
}
for i := 0; i < numMappers; i++ {
inChansMap[i] = make(chan interface{}, CHANBUF)
inChansPartition[i] = make(chan interface{}, CHANBUF)
go j.Map(inChansMap[i], inChansPartition[i])
go j.Partition(inChansPartition[i], inChansReduce, &wgPartition)
}
31
GoMR - Synchronization
var wgPartition, wgReduce sync.WaitGroup
wgPartition.Add(numMappers)
wgReduce.Add(numReducers)
34
Input
Mapper
Mapper
Reducer
Reducer
Output
Partitioner
Partitioner
go func() {
wgPartition.Wait()
for i := 0; i < numReducers; i++ {
close(inChansReduce[i])
}
wgReduce.Wait()
close(outChanReduce)
}()
Evaluation
36
GoMR
Why Spark?
● Pyspark
● Simple install
● Uses all cores
37
Wordcount - PySpark
words = sc.textFile(sys.argv[1]) 
.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b, int(parallelism))
38
Evaluation - Server
40
Evaluation - Errors
GoMR - 7
panic: interface conversion: interface {} is
main.Count, not string
goroutine 28 [running]:
main.(*WordCount).Partition(0x5e5740,
0xc000130540, 0xc000132100, 0x8, 0x8,
0xc00012a250)
/home/cnnrznn/go/src/github.com/cnnrznn/gomr
/examples/wordcount/wordcount.go:51 +0x23d
created by github.com/cnnrznn/gomr.RunLocal
/home/cnnrznn/go/src/github.com/cnnrznn/gomr
/gomr.go:148 +0x2d6
Spark - 256
org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File
"/home/cnnrznn/spark/python/lib/pyspark.zip/pys
park/worker.py", line 377, in main
process()
File
"/home/cnnrznn/spark/python/lib/pyspark.zip/pys
park/worker.py", line 372, in process
serializer.dump_stream(func(split_index,
iterator), outfile)
File
"/home/cnnrznn/spark/python/lib/pyspark.zip/pys
park/rdd.py", line 2499, in pipeline_func
File
"/home/cnnrznn/spark/python/lib/pyspark.zip/pys
park/rdd.py", line 2499, in pipeline_func
File executor driver):
41
Evaluation - Configuration
wc := &WordCount{}
nprocs := runtime.NumCPU()
ins, out := gomr.RunLocal(nprocs, nprocs, wc)
42
GoMR on K8s
● Kubernetes
● I/O
● Jobs & Drivers
43
Kubernetes
● Containers
○ Pod == set of
containers
● Scheduling
● Networking
● Why spin my own
control plane?
44
Pod
Pod
Pod
Node
Pod
Pod
Node
Node
Pod
Pod
Node
I/O
● Containers are ephemeral
● PersistentVolumeClaim mounts a drive on
the container
● Input and output through mounted drive
45
Prerequisites
● kubectl with access to a cluster
● PersistentVolume and
PersistentVolumeClaim
● https://github.com/cnnrznn/wordcount-gomr
46
MR Application
47
Job Job
Input.1
Input.2
Input.3
Input.4
Output.1
Output.2
Output.3
Output.4
…
Jobs
type WordCount struct{}
func (w *WordCount) Map(in <-chan interface{}, out chan<- interface{}) {
// …
}
func (w *WordCount) Partition(in <-
chan interface{}, outs []chan interface{}, wg *sync.WaitGroup) {
// …
}
func (w *WordCount) Reduce(in <-chan interface{}, out chan<
interface{}, wg *sync.WaitGroup) {
// …
}
func main() {
gomr.RunDistributed(&WordCount{})
}
48
Container
Drivers
package main
import "github.com/cnnrznn/gomr/pkg/driver"
func main() {
d := driver.NewDriver("Wordcount App", 3)
d.Run("wordcount", "/data/moby.txt", "/data/wc.txt")
}
49
Kubernetes
50
Pod
Pod
Map
Node
Reduce
Node
Node
Map
Node
Map
Reduce
Reduce
Job
Driver
Existing solutions
● http://marcio.io/2015/07/cheap-
mapreduce-in-go/
● https://godoc.org/github.com/ahamidi/g
o-mapreduce
● https://medium.com/@jayhuang75/a-
simple-mapreduce-in-go-42c929b000c5
● https://appliedgo.net/mapreduce/
● https://blog.gopheracademy.com/advent
-2015/glow-map-reduce-for-golang/
● https://github.com/chrislusf/gleam
● https://github.com/darkjh/go-
mapreduce
52
One-off / Straight-line
Hard-coded types
Mature
Almost Mature
Questions?
● Problem
● MapReduce Background
● GoMR
● Evaluation
● GoMR on K8s
53
Connor Zanin
● Programming in Go ~1 year
● MSc from Northeastern
● https://connorzanin.com
● https://github.com/cnnrznn
● https://github.com/gomr
54
Hire Me
Design Questions
● Values or Channels?
● Separate driver and job?
● Using K8s job vs Deployment
● UUID for k8s jobs
55
Wordcount - GoMR
func (w *WordCount) Map(in <-chan interface{},
out chan<- interface{}) {
counts := make(map[string]int)
for elem := range in {
for _, word := range strings.Fields(elem.(string)) {
counts[word]++
}
}
for k, v := range counts {
out <- Count{k, v}
}
close(out)
} 56
Wordcount - GoMR
func (w *WordCount) Partition(in <-chan interface{},
outs []chan interface{}, wg *sync.WaitGroup) {
for elem := range in {
key := elem.(Count).Key
h := sha1.New()
h.Write([]byte(key))
hash := int(binary.BigEndian.Uint64(h.Sum(nil)))
if hash < 0 {
hash = hash * -1
}
outs[hash%len(outs)] <- elem
}
wg.Done()
}
57
Wordcount - GoMR
func (w *WordCount) Reduce(in <-chan interface{},
out chan<- interface{}, wg *sync.WaitGroup) {
counts := make(map[string]int)
for elem := range in {
ct := elem.(Count)
counts[ct.key] += ct.Value
}
for k, v := range counts {
out <- Count{k, v}
}
wg.Done()
}
58
Wordcount - GoMR
func main() {
wc := &WordCount{}
par := runtime.NumCPU()
ins, out := gomr.RunLocal(par, par, wc)
gomr.TextFileParallel(os.Args[1], ins)
for count := range out {
fmt.Println(count)
}
}
60

More Related Content

What's hot

Revisiting Visitors for Modular Extension of Executable DSMLs
Revisiting Visitors for Modular Extension of Executable DSMLs Revisiting Visitors for Modular Extension of Executable DSMLs
Revisiting Visitors for Modular Extension of Executable DSMLs
Manuel Leduc
 
Gnu debugger
Gnu debuggerGnu debugger
Gnu debugger
Gizem Çetin
 
Golang concurrency design
Golang concurrency designGolang concurrency design
Golang concurrency design
Hyejong
 
Multi qubit entanglement
Multi qubit entanglementMulti qubit entanglement
Multi qubit entanglement
Vijayananda Mohire
 
pg_filedump
pg_filedumppg_filedump
pg_filedump
Aleksander Alekseev
 
Git presentation bixlabs
Git presentation bixlabsGit presentation bixlabs
Git presentation bixlabs
Bixlabs
 
Low pause GC in HotSpot
Low pause GC in HotSpotLow pause GC in HotSpot
Low pause GC in HotSpot
jClarity
 
Tech Talk - Konrad Gawda : P4 programming language
Tech Talk - Konrad Gawda : P4 programming languageTech Talk - Konrad Gawda : P4 programming language
Tech Talk - Konrad Gawda : P4 programming language
CodiLime
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
Aleksander Alekseev
 

What's hot (9)

Revisiting Visitors for Modular Extension of Executable DSMLs
Revisiting Visitors for Modular Extension of Executable DSMLs Revisiting Visitors for Modular Extension of Executable DSMLs
Revisiting Visitors for Modular Extension of Executable DSMLs
 
Gnu debugger
Gnu debuggerGnu debugger
Gnu debugger
 
Golang concurrency design
Golang concurrency designGolang concurrency design
Golang concurrency design
 
Multi qubit entanglement
Multi qubit entanglementMulti qubit entanglement
Multi qubit entanglement
 
pg_filedump
pg_filedumppg_filedump
pg_filedump
 
Git presentation bixlabs
Git presentation bixlabsGit presentation bixlabs
Git presentation bixlabs
 
Low pause GC in HotSpot
Low pause GC in HotSpotLow pause GC in HotSpot
Low pause GC in HotSpot
 
Tech Talk - Konrad Gawda : P4 programming language
Tech Talk - Konrad Gawda : P4 programming languageTech Talk - Konrad Gawda : P4 programming language
Tech Talk - Konrad Gawda : P4 programming language
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
 

Similar to GoMR: A MapReduce Framework for Go

My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 
MapReduce
MapReduceMapReduce
MapReduce
SatyaHadoop
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
Archana Gopinath
 
MapReduce
MapReduceMapReduce
MapReduce
ahmedelmorsy89
 
MapReduce
MapReduceMapReduce
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Scalding
ScaldingScalding
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
Bhupesh Chawda
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
b0ris_1
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
MapReduce wordcount program
MapReduce wordcount program MapReduce wordcount program
MapReduce wordcount program
Sarwan Singh
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Ashraf Uddin
 

Similar to GoMR: A MapReduce Framework for Go (20)

My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Scalding
ScaldingScalding
Scalding
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce wordcount program
MapReduce wordcount program MapReduce wordcount program
MapReduce wordcount program
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 

Recently uploaded

APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 

Recently uploaded (20)

APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 

GoMR: A MapReduce Framework for Go

  • 1. About Me - Connor Zanin ● Boston ● Programming in Go ~1 year ● Working with MR since 2014 ● MSc from Northeastern ● https://connorzanin.com ● https://github.com/cnnrznn ● https://github.com/gomr 1
  • 2. GoMR: A MapReduce Framework for Golang Connor Zanin
  • 3. Outline ● Problem ● MapReduce Background ● GoMR ● Evaluation ● GoMR on K8s Ask Questions! 3
  • 4. The Problem ● Class on large-scale data processing (MR) ● Implement triangle count on Hadoop MapReduce ○ 2 JOIN ops ○ Large intermediate tables 5
  • 5. The Problem ● Configurations ● Error messages ● Performance 8
  • 7. What is MapReduce? Jeff Dean, Sanjay Ghemawat @ Google MapReduce: Simplified data processing on large clusters ● “Hundreds of special-purpose computations that process large amounts of raw data” ● Simple computations ● Complex distribution + fault tolerance code 10
  • 8. What is MapReduce? “MapReduce is a programming model and implementation for processing … big data sets with a parallel, distributed algorithm on a cluster” - Wikipedia 11
  • 9. MapReduce - Definitions ● Map: Transform input elements to <Key, Value> pairs ● Reduce: Transform all Values per Key 12
  • 10. MapReduce – Word Count [cat, dog, cat] [(cat, 1), (dog, 1), (cat, 1)] x => (x, 1) Map Reduce 𝑘 𝑣 13 [(cat, 2), (dog, 1)]
  • 11. MapReduce - Pipeline 14 Input Map Reduce Output x => (x, 1) ∑ [cat, dog, cat] [(cat, 2), (dog, 1)]
  • 12. MapReduce - Parallel Pipeline Input Mapper Mapper Mapper Reducer Reducer Reducer Output 15
  • 13. MapReduce - Parallel Pipeline 16 Cat Cat Dog (Cat,1) (Dog,1) (Cat,1)
  • 14. MapReduce - Parallel Pipeline Input Mapper Mapper Mapper Reducer Reducer Reducer Output 17 ?
  • 15. MapReduce - Definitions ● Map: Transform input elements to <Key, Value> pairs ● Reduce: Transform all Values per Key ● Partition: Group <Key, Value> pairs by Key 18
  • 16. MapReduce - Parallel Pipeline 19 Cat Cat Dog (Cat, 1) (Dog, 1) (Cat, 1) (Cat, 2)
  • 17. GoMR ● Goals ● Design ● Implementation 21
  • 18. GoMR - Goal ● Minimal configuration ● Simple error messages ● Memory, CPU efficient 22 Let’s
  • 20. GoMR - Interfaces type Mapper interface { Map(in <-chan interface{}, out chan<- interface{}) } type Partitioner interface { Partition(in <-chan interface{}, outs []chan interface{}, wg *sync.WaitGroup) } type Reducer interface { Reduce(in <-chan interface{}, out chan<- interface{}, wg *sync.WaitGroup) } 24 Mapper Partitioner Reducer
  • 21. GoMR - Interfaces type Job interface { Mapper Partitioner Reducer } 25
  • 22. GoMR - Wiring inChansMap = make([]chan interface{}, numMappers) inChansPartition := make([]chan interface{}, numMappers) inChansReduce := make([]chan interface{}, numReducers) outChanReduce = make(chan interface{}, CHANBUF) 26 Input Mapper Mapper Reducer Reducer Output Partitioner Partitioner 1 1 2 2 3 4 3 4
  • 23. Input Mapper Mapper Reducer Reducer Output Partitioner Partitioner GoMR – Channels & Goroutines for i := 0; i < numReducers; i++ { inChansReduce[i] = make(chan interface{}, CHANBUF) go j.Reduce(inChansReduce[i], outChanReduce, &wgReduce) } for i := 0; i < numMappers; i++ { inChansMap[i] = make(chan interface{}, CHANBUF) inChansPartition[i] = make(chan interface{}, CHANBUF) go j.Map(inChansMap[i], inChansPartition[i]) go j.Partition(inChansPartition[i], inChansReduce, &wgPartition) } 31
  • 24. GoMR - Synchronization var wgPartition, wgReduce sync.WaitGroup wgPartition.Add(numMappers) wgReduce.Add(numReducers) 34 Input Mapper Mapper Reducer Reducer Output Partitioner Partitioner go func() { wgPartition.Wait() for i := 0; i < numReducers; i++ { close(inChansReduce[i]) } wgReduce.Wait() close(outChanReduce) }()
  • 26. Why Spark? ● Pyspark ● Simple install ● Uses all cores 37
  • 27. Wordcount - PySpark words = sc.textFile(sys.argv[1]) .flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b, int(parallelism)) 38
  • 29. Evaluation - Errors GoMR - 7 panic: interface conversion: interface {} is main.Count, not string goroutine 28 [running]: main.(*WordCount).Partition(0x5e5740, 0xc000130540, 0xc000132100, 0x8, 0x8, 0xc00012a250) /home/cnnrznn/go/src/github.com/cnnrznn/gomr /examples/wordcount/wordcount.go:51 +0x23d created by github.com/cnnrznn/gomr.RunLocal /home/cnnrznn/go/src/github.com/cnnrznn/gomr /gomr.go:148 +0x2d6 Spark - 256 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/cnnrznn/spark/python/lib/pyspark.zip/pys park/worker.py", line 377, in main process() File "/home/cnnrznn/spark/python/lib/pyspark.zip/pys park/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/cnnrznn/spark/python/lib/pyspark.zip/pys park/rdd.py", line 2499, in pipeline_func File "/home/cnnrznn/spark/python/lib/pyspark.zip/pys park/rdd.py", line 2499, in pipeline_func File executor driver): 41
  • 30. Evaluation - Configuration wc := &WordCount{} nprocs := runtime.NumCPU() ins, out := gomr.RunLocal(nprocs, nprocs, wc) 42
  • 31. GoMR on K8s ● Kubernetes ● I/O ● Jobs & Drivers 43
  • 32. Kubernetes ● Containers ○ Pod == set of containers ● Scheduling ● Networking ● Why spin my own control plane? 44 Pod Pod Pod Node Pod Pod Node Node Pod Pod Node
  • 33. I/O ● Containers are ephemeral ● PersistentVolumeClaim mounts a drive on the container ● Input and output through mounted drive 45
  • 34. Prerequisites ● kubectl with access to a cluster ● PersistentVolume and PersistentVolumeClaim ● https://github.com/cnnrznn/wordcount-gomr 46
  • 36. Jobs type WordCount struct{} func (w *WordCount) Map(in <-chan interface{}, out chan<- interface{}) { // … } func (w *WordCount) Partition(in <- chan interface{}, outs []chan interface{}, wg *sync.WaitGroup) { // … } func (w *WordCount) Reduce(in <-chan interface{}, out chan< interface{}, wg *sync.WaitGroup) { // … } func main() { gomr.RunDistributed(&WordCount{}) } 48 Container
  • 37. Drivers package main import "github.com/cnnrznn/gomr/pkg/driver" func main() { d := driver.NewDriver("Wordcount App", 3) d.Run("wordcount", "/data/moby.txt", "/data/wc.txt") } 49
  • 39. Existing solutions ● http://marcio.io/2015/07/cheap- mapreduce-in-go/ ● https://godoc.org/github.com/ahamidi/g o-mapreduce ● https://medium.com/@jayhuang75/a- simple-mapreduce-in-go-42c929b000c5 ● https://appliedgo.net/mapreduce/ ● https://blog.gopheracademy.com/advent -2015/glow-map-reduce-for-golang/ ● https://github.com/chrislusf/gleam ● https://github.com/darkjh/go- mapreduce 52 One-off / Straight-line Hard-coded types Mature Almost Mature
  • 40. Questions? ● Problem ● MapReduce Background ● GoMR ● Evaluation ● GoMR on K8s 53
  • 41. Connor Zanin ● Programming in Go ~1 year ● MSc from Northeastern ● https://connorzanin.com ● https://github.com/cnnrznn ● https://github.com/gomr 54 Hire Me
  • 42. Design Questions ● Values or Channels? ● Separate driver and job? ● Using K8s job vs Deployment ● UUID for k8s jobs 55
  • 43. Wordcount - GoMR func (w *WordCount) Map(in <-chan interface{}, out chan<- interface{}) { counts := make(map[string]int) for elem := range in { for _, word := range strings.Fields(elem.(string)) { counts[word]++ } } for k, v := range counts { out <- Count{k, v} } close(out) } 56
  • 44. Wordcount - GoMR func (w *WordCount) Partition(in <-chan interface{}, outs []chan interface{}, wg *sync.WaitGroup) { for elem := range in { key := elem.(Count).Key h := sha1.New() h.Write([]byte(key)) hash := int(binary.BigEndian.Uint64(h.Sum(nil))) if hash < 0 { hash = hash * -1 } outs[hash%len(outs)] <- elem } wg.Done() } 57
  • 45. Wordcount - GoMR func (w *WordCount) Reduce(in <-chan interface{}, out chan<- interface{}, wg *sync.WaitGroup) { counts := make(map[string]int) for elem := range in { ct := elem.(Count) counts[ct.key] += ct.Value } for k, v := range counts { out <- Count{k, v} } wg.Done() } 58
  • 46. Wordcount - GoMR func main() { wc := &WordCount{} par := runtime.NumCPU() ins, out := gomr.RunLocal(par, par, wc) gomr.TextFileParallel(os.Args[1], ins) for count := range out { fmt.Println(count) } } 60

Editor's Notes

  1. Hi everyone, my name is Connor and tonight I’m going to talk about a package I wrote using Go. The package is called GoMR, and it is a fast, efficient implementation paradigm used for processing big-data.
  2. Tracing my program to optimize Closing on multiple-sender channels Channels as an API
  3. TODO investigate importance of triangle count
  4. Type equation here.
  5. I was frustrated I knew what I wanted to do Painful to do in Hadoop – wasn’t having fun Can I use go for this?
  6. Mapreduce comes from functional programming 2 functions Map -> function over items in collection Reduce -> derive single value from collection
  7. Map Reduce comes from functional programming conceptually
  8. We can parallelize the map-reduce model Need an additional step, shuffle, for useful programs
  9. We can parallelize the map-reduce model Need an additional step, shuffle, for useful programs
  10. We can parallelize the map-reduce model Need an additional step, shuffle, for useful programs
  11. We can parallelize the map-reduce model Need an additional step, shuffle, for useful programs
  12. As an example, consider conference registration First, attendants register online This is the map Then, the attendants arrive and are sorted alphabetically This is the partition Finally, each registrar may keep metrics about how many of their assigned attendees actually show up
  13. Implementation for single machine 38 lines
  14. Mapper, partitioner, reducer === user logic Library defines interface User implements Map(), Partition(), Reduce() Library allocates goroutines, channels Library handles synchronization
  15. Each stage needs to know when the previous is done Input and map => close(ch) Partitioner, reducer => wg.Done()
  16. Each stage needs to know when the previous is done Input and map => close(ch) Partitioner, reducer => wg.Done()
  17. Why compare to spark? Easier to setup for local execution Uses JVM still
  18. TODO fix this!!!
  19. TODO fix this!!!
  20. Code compiled to container
  21. Driver talks to Kubernetes Re-run if the jobs fail