1. COLLEGE OF COMPUTING, GEORGIA INSTITUTE OF TECHNOLOGY
Workshop 8/Systems Workshop 3:
Worker Task Execution
In this module of the class, you are going to implement the required code to
execute the map and reduce tasks on the worker. Use the code created in the
previous workshop as a base for the implementation.
1 EXPECTED OUTCOME
The student is going to:
• Deploy MapReduce applications running in Kubernetes to Azure
• Develop An HTTP user interface for submitting input files and mapper and reducer
functions to our system.
• Create working mapper and reducer functions that execute user-submitted Python code
inside the worker nodes.
• Design and implement the interfaces and functionalities for the execution of the map
and reduce phases in the workers.
2 ASSUMPTIONS
This workshop assumes that the student had successfully completed all the previous work-
shops on this module; and the corresponding assumptions for those workshops.
1
2. 3 DOWNLOAD RELEVANT ENVIRONMENT TOOLS
Install the Azure CLI.
4 SPECIFICATION
Your MapReduce implementation should be able to:
• Deploy your MapReduce cluster to Azure Kubernetes Service(AKS)
• Execute map/reduce phase in any worker
• After executing the map phase, sort the mapper result in place and store it in the corre-
sponding location.
• Store the required information in the master to be able to fetch the required <key,value>
pairs to execute the Reduce phase in the corresponding worker.
• Store the final results into Azure Blob, you should be able to use this data as an input for
a pipelined map/reduce computation.
5 IMPLEMENTATION
5.1 DEPLOYING RESOURCES
You would need to deploy your kubernetes cluster(which has been running locally up till now)
to Azure Kubernetes Service. You would need to create an AKS instance and use kubectl to
deploy. Please consult the AKS Docs.
Right now your container images are only available locally, you would need to push your
images to Azure using Azure Container Registry, and configure your Kubernetes deployment
to use the correct images. Please consult the ACR Docs.
You can configure access to both your local cluster running in KIND and to your Azure
deployment via the kubeconfig. Please use these docs to learn more.
5.1.1 USEFUL LINKS
• Kubernetes Walkthrough
• Configure Kubernetes KIND Cluster in Azure
5.2 USER INTERFACE FOR MAPREDUCE
Creating a good user interface for software is an important aspect of developing a successful
product. In this section, you are going to develop an HTTP interface to our MapReduce service.
This interface should be on the master node and it should allow you to POST a job to
our service. It is up to you to specify how this interface works. Think about what kind of
2
3. information you are going to need to POST and how you might use HTTP to transfer that
information.
We recommend checking out the http netlib library to accomplish this goal.
If you did not set up Kubernetes readiness probes in the first week, we recommend that you
do so now. It will be a simple addition to this interface.
5.2.1 USEFUL LINKS
• c++ netlib library
5.3 PYTHON CODE WRAPPER
The map and reduce functions are going to be implemented in Python. The Python script
receives each input value through the standard input and writes the key value pairs through
the standard output. Your worker functions need to be able to feed the inputs as stdin to the
Python scripts, start the execution of the code, and capture the output of the Python script.
5.3.1 OPTION A
The map and reduce components are going to be implemented in Python, similar to the first
workshop of this course. The python script receives each input value through the standard
input and writes the key values through the standard output. Your code needs to be able to
both feed the inputs to the python script and save the results from the output of the python
script and start the execution of this programs. To be able to accomplish this task you are
going to use four functions: pipe, execl and fork, and dup2. Using these functions you are
going to implement a bidirectional pipe to communicate with the python code.
5.3.2 SUGGESTIONS
• Discuss with the other students about the corner and error cases that can arise when
using the four suggested functions, how do we avoid deadlock scenarios? and how do
we handle these situations?.
5.3.3 OPTION B
Another possibility for implementing the python function call is Extending python with C. In
which we use the file Python.h functions to call the python function directly from C++.
5.3.4 SUGGESTIONS
• Discuss with the other students about the benefits and drawbacks of using either option
A or option B, and potentially suggest other options. You are free to choose a different
way to run the mapper and reducer, but be sure to analyze the cons and pros of your
solution.
3
4. 5.3.5 USEFUL LINKS
• Calling Python Functions from C
• execl(3) - Linux man page
• pipe(2) - Linux man page
• fork(2) - Linux man page
• Piping for input/output
• Creating pipes in C
• Popen
5.4 SAVE INTERMEDIATE RESULT
Using the API created in the previous workshop, save the output created by the map phase into
the intermediate storage, there should be R outputs created. The structure of this intermediate
storage is going to depend on the specification file presented as a deliverable for the previous
workshop.
5.5 SUGGESTIONS
• Discuss with other students about ways to store the intermediate results. Should it be in
blob storage or local storage of the workers? Should there be M*R outputs in total files,
or only R output files using atomic append operations? If you are using local files are
you using Linux commands like scp to copy the files or are you using RPC connections
to the workers?
5.6 SAVE FINAL RESULT
Using the API created in the previous workshop save the output generated from the reduce
phase into the final location.
5.6.1 SUGGESTIONS
• Your framework should be able to use it as an input to a pipeline of map reduce execu-
tions.
6 DELIVERABLES
• The git repo that contains all the required code and commit id.
• A demo that shows:
– Deploy your system to Azure Kubernetes.
4
5. – Configure your kubectl cli to point to the Azure cluster.
– Demonstrate your ability to scale your worker and master nodes via the kubectl cli.
– Submit a job via an HTTP request to your cluster.
– Show the output of your MapReduce job.
7 USEFUL REFERENCES
• MapReduce paper
5