Introduction to Data Science

Data Science
Introduction to Data
Science

LIVE On-line Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work on Large Data Base
Verifiable Certificate
How it Works?
Slide 2 www.edureka.in/data-science

Topics for the Day
 Big Data
 Big Data Scenarios
 Big Data Challenges
 Introduction to Data Science
 Data Science: Components
 Types of DataScientists
 Data Science: Core Components
 Use-Cases
 Introduction to Hadoop and R
 R and Hadoop Integration
 Machine Learning with Mahout
 References

Objectives
At the end of this module, you will be able to
 Understand Big Data and its challenges
 Implement Big Data in real time scenarios
 List and explain the components and prospects of Data Science
 Learn the implementation of Hadoop on Big data
 Analyze some real world use-cases with the help of R programming Language
 Understand machine learning concepts

Data Science

Big Data

What is Big Data?
Lots of Data
(Terabytes or
Petabytes)
Systems/Enterprises
generate huge amount
of data from Terabytes
to and even Petabytes
of information
Slide 8 www.edureka.in/data-sciencehttp://www.today.mccombs.utexas.edu/2012/04/the-big-data-machine

Big Data Scenarios
Slide 9 www.edureka.in/data-sciencehttp://www.clker.com/clipart-13967.html

Big Data Scenarios: Sports
Slide 9 www.edureka.in/data-sciencehttp://www.espncricinfo.com/

Big Data Scenarios: Sports
Sports teams are using data for tracking ticket
sales and even for tracking team strategies.
Advertising and marketing agencies are tracking
social media to understand responsiveness to
campaigns, promotions, and other advertising
mediums
Slide 10 www.edureka.in/data-sciencehttp://www.espncricinfo.com/

Big Data Scenarios : Hospital Care
Slide 12 www.edureka.in/data-sciencehttp://www.majorprojects.vic.gov.au/our-projects/our-past-projects/austin-hospital

Big Data Scenarios : Hospital Care
Hospitals are analyzing medical data and patient
records to predict those patients that are likely to seek
readmission within a few months of discharge. The
hospital can then intervene in hopes of preventing
another costly hospital stay.
Medical diagnostics company analyzes millions of lines
of data to develop first non-intrusive test for
predicting coronary artery disease. To do so,
researchers at the company analyzed over 100 million
gene samples to ultimately identify the 23 primary
predictive genes for coronary artery disease

Big Data Scenarios : Amazon.com
Slide 13 www.edureka.in/data-sciencehttp://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png

Amazon has an unrivalled bank of data on online consumer
purchasing behaviour that it can mine from its 152 million
customer accounts.
Amazon also uses Big Data to monitor, track and secure its 1.5
billion items in its retail store that are laying around it 200
fulfilment centres around the world. Amazon stores the
product catalogue data in S3.
S3 can write, read and delete objects up to 5 TB of data each.
The catalogue stored in S3 receives more than 50 million
updates a week and every 30 minutes all data received is
crunched and reported back to the different warehouses and
the website.
Big Data Scenarios : Amazon.com
Slide 14 www.edureka.in/data-sciencehttp://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png

Big Data Scenarios: NetFlix
Slide 15 www.edureka.in/data-sciencehttp://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png

Netflix uses 1 petabyte to store the videos for streaming.
BitTorrent Sync has transferred over 30 petabytes of data
since its pre-alpha release in January 2013.
The 2009 movie Avatar is reported to have taken over 1
petabyte of local storage at Weta Digital for the rendering
of the 3D CGI effects.
One petabyte of average MP3-encoded songs (for mobile,
roughly one megabyte per minute), would require 2000
years to play.
Big Data Scenarios: NetFlix
Slide 16 www.edureka.in/data-sciencehttp://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png

Big Data Scenarios: The Large Hadron Collider
Slide 18 www.edureka.in/data-sciencehttp://www.crowdsourcing.org/article/-nasa-tries-to-free-creativity-with-big-data-challenge/19984

The experiments in the Large Hadron Collider produce
about 15 petabytes of data per year, which are
distributed over the Worldwide LHC Computing Grid.
One petabyte is enough to store the DNA of the
entire population of the USA - with cloning it twice.
Big Data Scenarios: The Large Hadron Collider
Slide 19 www.edureka.in/data-sciencehttp://en.wikipedia.org/wiki/Large_Hadron_Collider

IBM’s Definition
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Audios
Images
Videos
Sensor
Data
VOLUME VELOCITY VARIETY

IBM’s Definition
 Structured
 Unstructured
 Semi structured
 All the above
Variety
3 Vs of
Big data
 Batch
 Near Time
 Real Time
 Streams
Velocity
 Terabytes
 Records
 Transactions
 Tables, files
Volume
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

www.edureka.in/data-sciencehttp://whatsthebigdata.files.wordpress.com/2013/11/batman-on-big-data.jpg
What about ‘Veracity’?

Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to make
you guys think and answer my
questions.
Annie’s Introduction

Map the following to corresponding type: Structured/ Unstructured/ Semi-
structured.
- XML Files
- Word Docs, PDF files, Text files
- E-Mail body
- Data from Enterprise systems (ERP, CRM etc.)
Annie’s Question

XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Annie’s Answer

Big Data: Challenges
Slide 26 www.edureka.in/data-sciencehttp://spinnakr.com/blog/wp-content/uploads/2013/08/Using-Big-Data-.jpg

Big Data
Challenges
Data security and
Privacy
High variety of
Information
High veracity of
Data
Data Acquisition
High velocity of
processed Data
Information search
and Analytics
High volume of
Data
Information storage
and Analytics
Big Data: Challenges

www.edureka.in/data-sciencehttp://thesocietypages.org/sociologylens/files/2013/09/BIgDataDilbert_Cartoon.jpg

Data Science
Slide 29 www.edureka.in/data-sciencehttp://escience.washington.edu/blog/uw-berkeley-nyu-collaborate-378m-data-science-initiative

Data Science
“More data usually beats better algorithms,”
Such as: Recommending movies or music based on past preferences.

No matter how extremely unpleasant your algorithm is, they can often be beaten simply by having
more data (and a less sophisticated algorithm).
Big Data is here
Bad News We are struggling to
store and analyze it.
Good News
Data Science

Data Science: Components
Slide 32 www.edureka.in/data-sciencehttp://abstrusegoose.com/55

Data Science
Visualization
Advanced Computing
Domain Expertise
Statistics
Data Engineering
Data Science: Components

Data Science: Prospects

Types of Data Scientists
Based on clustering the ways that data is handled by Data Scientists, the following 4 categories can be created:
 Data Businesspeople are the product and profit-focused data scientists. They’re leaders, managers, and
entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an
MBA.
 Data Creatives are eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may
think of themselves as artists or hackers, and excel at visualization and open source technologies.
 Data Developers are focused on writing software to do analytic, statistical, and machine learning tasks, often
in production environments. They often have computer science degrees, and often work with so-called “big
data”.
 Data Researchers apply their scientific training, and the tools and techniques they learned in academia, to
organizational data. They may have PhDs, and their creative applications of mathematical tools yields valuable
insights and products.
Slide 35 www.edureka.in/data-sciencehttp://datacommunitydc.org/blog/2013/06/there-is-more-than-one-kind-of-data-scientist/

Relationships - Four Categories and the Five Skill Groups
Slide 36 www.edureka.in/data-sciencehttp://datacommunitydc.org/blog/wp-content/uploads/2012/08/SkillsSelfIDMosaic-edit-500px.png

Data Science: Core Components
Data Science
Data Architecture
Tool: Hadoop
Machine Learning
Tool: Mahout
Analytics
Tool: R

Use-Cases

No one Knows How to Use it

Use-Case Implementation: Techniques Used
A Problem
Dataset
Analysis
Results

Understanding the
Machine Learning
algorithm to be
used Implementing Machine
Learning in Hadoop on Big
Data Visualisation of
the analysis
Understanding the
problem statement
and defining the
solution
Exploring ways to
integrate R with
Hadoop
Implementing Machine
Learning algorithm in R on
the smaller dataset
Use-Case Implementation:Process Flow Diagram

Domain of the Dataset:
Communications and Media. However, the
application of the algorithm is not limited to only
Communications and Media. The technique is
useful for any domain which requires organizing
documents to improve retrieval and support
browsing.
Problem Statement:
A top media company wants to browse through
the popular news from a collection that appeared
on the Reuters newswire in 1987.
Clustering / Grouping documents based on their
contents will make the analysis easier.
Media Use-Case
The Reuters-21578 data set composition

Media Use-Case: K-means Clustering
First we will
understand the
implementation of the
technique in R on a
smaller dataset
Then we will understand how
to achieve document
clustering on Big Data using
Mahout libraries on Hadoop
K-Means Clustering can
be implemented on this
dataset
Communications and
Media Dataset to be
Clustered based on
their contents
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Content-wise
Clustered/Grouped
documents

Products and Retail. However, the application of the
algorithm is not limited to only Products and Retail. The
technique can be applied wherever we want to discover
the co-occurrence relationship amongst various
activities.
Problem Statement:
Market Basket Analysis.
A retail outlet wants understand the purchase behavior
of a buyer. This information will enable the retailer to
understand the buyer's needs.
The analysis might tell a retailer that customers often
purchase shampoo and conditioner together, so putting
both items on promotion at the same time would create
a significant increase in profit, while a promotion
involving just one of the items would likely drive sales of
the other.
Market Basket Use-Case
Market Basket Analysis
98% of people
who purchased
items A and B
also purchased
item C

Market Basket Use-Case: Association Rule Mining
Product and Retail
Dataset
Understand the
technique on a smaller
dataset
Understand how to
achieve the same on
Big Data using Mahout
libraries on Hadoop
The technique used is
Affinity Analysis or
Association Rule Mining
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Market Basket
Analysis

www.edureka.in/data-science
Life Science and Health Care. However, the application
of the algorithm is not limited to only Life Science and
Health Care . The technique can be applied wherever
we want to forecast the occurrence of a event on the
basis of certain conditions.
Problem Statement:
A health care organization wants to forecast the onset
of diabetes mellitus in Indians using certain set of
attributes of patients as input such as:
 Plasma glucoseconcentration
 Diastolic bloodpressure
 Triceps skin fold thickness
etc.
Health Care Use-Case
http://www.thenewstribe.com/2013/11/15/diabetes-is-killing-one-patient-every-six-seconds/

Understand how to
achieve the same on Big
Data using Mahout
libraries on Hadoop
The technique used
is Affinity Analysis or
Association Rule
Mining.
R Implementation
Understand the basic
implementation of the technique
on a smaller dataset using R
Achieve parallel processing on
the same algorithm using a
parallel processing library
provided by Revolution R.
Hadoop
Implementation
Machine Learning
Implementation
Forecast the onset of
diabetes mellitus in
Indians
Life Science and
Health Care Dataset
with some attributes
of patients as input.
Health Care Use-Case: Parallel Processing

Social Media. However, the application of the
algorithm is not limited to only Social Media. The
technique can be applied wherever we want to put
documents into category without going through
the contents of all the documents.
Problem Statement:
A Social Media research firm wants to know the
trends of topics discussed on Twitter. For easy
analysis it wants to classify them in the following
categories:
 apparel (clothes, shoes, watches, …)
 art (Book, DVD, Music, …)
 camera
 event (travel, concert, …)
 health (beauty, spa, …)
 home (kitchen, furniture, garden, …)
 tech (computer, laptop, tablet, …)
http://www.mobigyaan.com/images/stories/Miscellaneous/mobigyaan-twitter-chat.jpg
Social Media Use-Case

Social Media Use-Case: Naïve Bayes Classifier
Understand the basic
technique on a smaller
dataset using R.
Understand how to
achieve the same on
Big Data using Mahout
libraries on Hadoop.
The technique used is
Naïve Bayes Classifier.
Social Media
dataset
R Implementation
Hadoop
Implementation
Machine Learning
Implementation
Categorical
classification of
the tweets

Going forward with the class, we will throw some light on the concepts of
Hadoop, R and Machine Learning respectively.
These topics will be vividly covered in their respective modules during the course.
Data Science: Core Components

Introduction to Hadoop

 Apache Hadoop is a framework that allows for the distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
 It is an Open-source Data Management with scale-out
storage & distributed processing.
 In 2004, Google published a paper on a process called
MapReduce.
parallel processing model
process huge amount of
 MapReduce framework provides a
and associated implementation to
data.
 Therefore, an implementation of MapReduce framework was
adopted by an Apache open source project named Hadoop.
Introduction to Hadoop

Hadoop Key Characteristics
Scalable
Reliable
Economical
Flexible
Robust
Ecosystem
Hadoop Key
Characteristics

Hadoop Core Components
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node

Hadoop is a framework that allows for the distributed processing of:
- Small Data Sets
- Large Data Sets
Annie’s Question

Large Data Sets. It is also capable to process small data-sets however to
experience the true power of Hadoop one needs to have data in Tb’s because
this where RDBMS takes hours and fails whereas Hadoop does the same in
couple of minutes.
Annie’s Answer

For setting-up Hadoop on your system you can follow the “Hadoop Installation Guide” present in the LMS.

Analytics with R

Analytics with R
Slide 59 www.edureka.in/data-sciencehttp://www.r-project.org/

R : Characteristics
 R is open source and free.
 R has lots of packages and multiple ways of doing the same thing.
 By default stores memory in RAM.
 R has the most advanced graphics. You need much better programming skills.
 R has GUI to help make learning easier.
 Customization needs commandline.
 R can connect to many database and data types.

Comparing R and others
http://r4stats.com/articles/popularity/
Comparing R

Comparing R with Base SAS* /SAS Stat*
R Base SAS* /SAS Stat*
R is open source and free
Base SAS* , SAS/Stat*, SAS/ET*, SAS/OR*,
SAS/Graph* are expensive relatively because of
annual licenses
Open source R has support from email lists,
twitter, stack overflow
SAS Institute* products have dedicated support
and extensive documentation
R is slower on the desktop than base SAS for
datasets ~4-5 gb
By default R stores memory in RAM, so we can
use the cloud
R has much better graphics You need much better programming skills
You can create custom functions in R easily Customization needs command line
R has multiple GUI that are free SAS GUI are more expensive
Slide 62 www.edureka.in/data-science*Copyright © 2012 SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513, USA. All rights reserved.

Annie’s Question
R Provides support in terms of:
1. Dedicated Support and Documentation
2. Email-lists, twitter, etc.

Annie’s Answer
Answer:
2. Email-lists, twitter, etc.

Annie’s Question
Custom functions can be easily created in :
1. SAS
2. R

Annie’s Answer
Answer:
1. R

Annie’s Question
Most of the functions in R are written in :
- Java
- R
- C
- Fortran

Annie’s Answer
Most of the user-visible functions in R are written in R.
It is possible for the user to interface to procedures written in the C, C++, or
FORTRAN languages for efficiency.

Introduction to R Programming language
www.r-project.org/about.html
 History
 Evolution
 Current State
 Open Source
 Free
 Widely Recognized
 Official Website
 R Core
 Creators
 R Journal

R and Hadoop Integration
 R and Hadoop are a natural match in Big Data Analytics and visualization.
 One of the most well-known R packages to support Hadoop functionalities is : RHadoop
 Rhadoop was developed by Revolution Analytics.
 RHadoop is a collection of three R packages: rmr, rhdfs and rhbase.
file rmr package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS
management in R and rhbase provides HBase database management from within R.
+

For setting-up R on your system you can follow the “R Installation Guide” present in the LMS under
module 1.

Machine Learning

Machine Learning: Mahout
 Machine Learning is a class of algorithms which is data-driven, i.e. unlike "normal" algorithms it is
the data that "tells" what the "good answer" is.
Example:
An hypothetical non-machine learning
algorithm for face recognition in images
would try to define
what a face is (round skin-like-colored
disk, with dark area where you expect the
eyes etc).
A machine learning algorithm would not
have such coded definition, but will
"learn-by-examples": you'll show several
images of faces and not-faces and a good
algorithm will
eventually learn and be able to predict
whether or not an unseen image is a face.
http://endthelie.com/2012/08/24/fbi-sharing-facial-recognition-software-with-police-departments-across-america/

Mahout Overview
Mahout is about scalable
Machine Learning
Mahout has functionality
for many of today’s
common machine
learning tasks
Machine Learning is all
over the web today
MapReduce magic in
action

Hadoop and
MapReduce magic in
action
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations
Machine Learning: LinkedIn Recommendations

Annie’s Question
Mahout Algorithms for clustering, classification and collaborative filtering are
implemented on top of Apache Hadoop using :
- Flume
- MapReduce
- Sqoop
- Hive

Annie’s Answer
Mahout Algorithms are implemented on top of Apache Hadoop using the
Map/Reduce paradigm.

1. Install R with the help of “R Installation Steps” guide in the LMS. This is a step wise guide which will help you in
installing and setting up R on your system
Assignment

Agenda for Next Class
In the next class you will be able to
 Understand what is R
 Describe why R is used?
 Implement R Programming Concepts
 Learn Data Import Techniques
 Analyze the Processing of Data

Pre-work
Go through the “R Essentials for Data Science” section in the LMS. Watch the recordings present in the
section to gain an understanding of the R environment.

What’s Within the LMS?

Recording
of the Class
Presentation
Quiz

Assignment
Installation
Guide
Pre-work

References
http://www.today.mccombs.utexas.edu/2012/04/the-big-data-machine
http://www.espncricinfo.com/
http://www.majorprojects.vic.gov.au/our-projects/our-past-projects/austin-hospital
http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
http://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png
http://www.crowdsourcing.org/article/-nasa-tries-to-free-creativity-with-big-data-challenge/19984
http://whatsthebigdata.files.wordpress.com/2013/11/batman-on-big-data.jpg
http://spinnakr.com/blog/wp-content/uploads/2013/08/Using-Big-Data-.jpg
http://thesocietypages.org/sociologylens/files/2013/09/BIgDataDilbert_Cartoon.jpg
http://abstrusegoose.com/55
http://www.thenewstribe.com/2013/11/15/diabetes-is-killing-one-patient-every-six-seconds/
http://www.mobigyaan.com/images/stories/Miscellaneous/mobigyaan-twitter-chat.jpg
http://www.r-project.org/
http://endthelie.com/2012/08/24/fbi-sharing-facial-recognition-software-with-police-departments-across-america/

Introduction to Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Data Science

Similar to Introduction to Data Science (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Science