Week 7 of the Data Scientist Enablement program focuses on advanced topics including MapReduce, Lambda Architecture, and Google BigQuery. Participants are instructed to continue tutorials on Hortonworks and explore Google public datasets. The assignment involves performing queries on a baseball statistics dataset using Hadoop, R, or BigQuery to analyze maximum and average runs by year and identify top players. Participants can earn a proficiency certificate based on their overall performance and mastery of concepts across the four module program.
1. Data Scientist Enablement
DSE 400 - Fast Track to Data Science
Week 7 Roadmap
Advanced Center of Excellence
Modern Renaissance Corporation
In Collaboration with SONO team and others
Content of this document is under Creative Commons Licence CC BY 4.0
2. Agenda
You can always find the latest version of this document at http://bit.ly/1fyOSnN
Week 7 Overview
Discussions
Learning Path
Activities
Assignment
Submission
Adaptive Learning
References
Citation
“Action is the foundational key to all success” - Pablo Picasso
3. Social Discourse:
Discuss about IBM Watson. Continue building R-COP and Modern Data Platforms-COP
Learning plan:
Read about MapReduce, Lambda Architecture, Google Query
Activities:
Continue Hortonworks Tutorials. Explore Google Public Datasets and BigQuery
Assignment 7:
Perform queries on Baseball Statistics dataset
DSE 400 - Week 7 at a glance
4. Discussion: Watch Ken Jennings: Watson, Jeopardy and me, the obsolete and share
your thoughts/reflections on the evolving domain of “Cognitive Computing”
Inline with our Open Innovation model, we are expanding our Social Discourse mode to
Linkedin, Facebook and Google+ Discussions on SONO will continue as planned on
DSE 400 Jump Pad. This will allow more choice for participants. We are hoping this will
result in the increased social engagement.
Check out Language R and Modern Data Platforms Communities of Practice (COPs) to
help you increase your competence in R, Machine Learning, Hadoop ecosystem and
other platforms. Reach out to Olivia Ramirez, Ellen Brock or Manju Rupani if you want
to contribute to these communities.
Social Engagement - Week 7
SONO Linkedin Facebook Google+
5. Read Practical illustration of Map-Reduce (Hadoop-style), on real data by Dr. Vincent Granville
Read Lambda Architecture for Big Data Systems by Michael Walker
Read Google BigQuery Tutorial
<Optional> Watch Hadoop - The Data Scientist's Dream
<Optional> Watch Hadoop MapReduce Example - How good are a city's farmer's markets by Helen Zeng
<Optional> Watch Google BigQuery in Ten Minutes
Recommended Learning Plan
6. Activities
<Practice> Check out Visualization of the Day at Data Science Central. As the name suggests, it is
going be different everyday. Explore the alternative ways of representing this. Could you have
presented this in a better way?
<Practice> Visit Google Public Data Directory. Explore Greenhouse Gas Emissions by country.
How does your country fare per capita wise compared to leading contributors. Also check out IMF
World Outlook dataset. Visualize the data on Unemployment rate (this can be found under people
category).
<Practice> Continue Hortonworks Tutorials on HDP 2.0. We will return to Hadoop and its
ecosystem in DSE 502 which will focus on Modern Data Platforms. In the meantime you can also
participate in Modern Data Platforms-Community of Practice, contribute to discussions on this
subject.
7. Assignment 7 - Submission Required
HDP 2.0 R-SQLDF BigQuery
Download Sean Lehman’s baseball statistics dataset. Using either HDP 2.0 (or its
equivalent Hadoop platform), or R-sqldf or Google BigQuery compute the following.
a) group the data contained in Batting table showing maximum runs every year
b) similarly group the data contained in Batting table showing average runs every year
c) display maximum runs for each year and the associated player (last_name and
first_name) using Batting and Master tables in combination (i.e. by joining Batting and
Master tables)
You may reach out to Rachel Fleming <rachel@emodern.biz> if you have any difficulties
with the assignments or looking for more challenging assignments or activities.
8. Submission in PDF format is required
Recommended Deadline: Saturday, 11:59 PM your local time. If you can’t submit
your assignment in time, please complete it and turn it in ASAP. While there is
no penalty for late submission, it will help you focus on next week’s lessons if
you turn in assignments in time.
Mail Assignment 7 to <dse400.datascience@gmail.com> with DSE 400 >
Assignment 7 in the subject line. Submit a single PDF document showing your
queries and result samples. Include screenshots as necessary. Naming
convention DSE 400 - Assignment 7 - Your Full Name is required for your
document for the sake of consistency. No document links should be sent. Just
one single PDF document, and Only in PDF format is accepted.
9. Adaptive Learning Options
Data Scientist Enablement program
Maturity Composite Score * Proficiency Certificate
Level 5 > 90 Innovating Capability Black Belt
Level 4 > 80 and <= 90 Architectural Capability Green Belt
Level 3 > 70 and <= 80 Solutioning Capability Yellow Belt
Level 2 > 60 and <= 70 Basic Understanding Completion
Level 1 <= 60 Basic Familiarity Audit
* Composite score is computed taking into consideration of performance of participants in assignments, activities, projects, social
engagement, collaboration, team development, publications and advanced research etc. in all 4 modules of DSE program
10. References, Resources and Additional Reading
17 short tutorials all Data Scientists should read (and practice). Dr. Granville. Data Science Central
Hadoop Illuminated. Kerzner and Maniyam. Hadoop Illuminated LLC 2013
Hadoop Definitive Guide. 3rd Edition. Tom White. O’Reilly Publications. 2012
Programming Hive. Capriolo et. al. O’Reilly Publications. 2012
Mapreduce: Simplified Data Processing on Large Clusters. Dean and Ghemavat. Google 2004
[MIT OCW] How to Process, Analyze and Visualize Data. Marcus & Wu. 2012
The Modern Data Architecture for Predictive Analytics
Big Data - Hadoop, Hive, Pig and Hbase video collection
Language R-Community of Practice
Modern Data Platforms-Community of Practice
Data Science Enablement playlist
11. Citation
Content that appears as is, on this document only, is under Creative Commons
License CC BY 4.0 This license may not necessarily apply to other material
referenced here in this document.
Baseball dataset used in this week’s activities and assignment is attributed to
Sean Lehman. This dataset is adapted under Creative Commons Licence 3.0
Content from IBM, Hortonworks, Google, Data Science Central and O’Reilly
Media etc. is excluded from the above Creative Commons License.
12. For More Information
Week 7 discussions take place during this week on DSE 400 forums on Linkedin, Facebook, Google+
and SONO. There is also an active Q&A session for everyone's benefit. Also check out Language R-
Community of Practice if you would like to advance your competence in R or if you would like to
contribute to this community.
<Mentoring On Demand> You may reach out to Rachel Fleming <rachel@emodern.biz> if you have
any difficulties with the assignments or looking for more challenging activities. If you need a mentor or
someone to help you accelerate along the DSE program, you may reach out to Vishal Kumar <wishall.
kumar@gmail.com> or Ligia Buzan<ligia.buzan@gmail.com>
We welcome questions, thoughts and suggestions. Post these in the right forums/discussions or write
to us at <dse400.datascience@gmail.com>
You can always find the latest version of this document and other DSE 400 roadmaps at http://bitly.
com/bundles/o_4ldaljhta4/1
13. Thank You
The Analytical Engine has no pretensions whatever to
originate anything. It can do whatever we know how to
order it to perform. It can follow analysis, but it has no
power of anticipating any analytical revelations or truths.
Its province is to assist us in making available what we
are already acquainted with. - Ada Lovelace