SlideShare a Scribd company logo
Big Data with R
Big Data refers to the large volume of data which may be organized or unorganized. This big
data is very essential for large organizations and businesses for valuable insights to determine
futuristic trends. Big Data is defined in terms of 3Vs which are as follows:
Volume – Volume refers to the quantity and amount of data and this data is increasing day by
day. Facebook has more number of users than the entire population of China. Its data is also
huge. The data is in the form of images, music, videos and all such stuff.
Velocity – Velocity refers to the rate at which the data is generated. Again taking the example of
Facebook, a huge amount of data is uploaded, shared each second on Facebook. People on social
media want new information and content each time they log in to social media. Old obsolete
news and information does not matter to them. Thus new information is shared at every second
on social media.
Variety – Coming to the third V of Big Data i.e. variety. Variety means diverse type of data.
There are multiple formats of data that can be stored. The data can be in the form of image,
video, text, pdf or excel. Big Data has a big challenge of managing this different type of data. An
organization need to arrange similar format of data together in order to extract important
information out of that.
Why is Big Data Analytics important?
Big Data and its analytics are important on account of the following reasons:
 Reduction in Cost – Big Data Analytics offer cost advantages using technologies like
Hadoop and Cloud Computing. These technologies help in storing and managing large
amount of data.
 Better Decision making – Using Hadoop and analytics, organizations and businesses are
able to make better and faster decisions by analyzing different sources of data.
 New services and product development – With the help of big data analytics,
companies can measure customer behavior and needs. Using these parameters, companies
can launch new products and services that will satisfy user needs.
R Programming Language
R is an open-source programming language and software environment for statistical study,
graphical representation, and reporting. The R language is extensively used by statisticians and
data miners for data analysis and software statistics. Robert Gentleman and Ross Ihaka are the
two authors of this language. The language is named ‘R’ from the first letter of the name of these
authors.
R software environment’s source code is written mainly in C, FORTRAN, and R language. R is
a GNU Package and is freely available under GNU General Public License.
What is GNU?
GNU is an acronym for “GNU’s Not Unix!” It is an operating system and is a collection of
computer software. Its design is like UNIX but it differs from UNIX in the sense that it is a free
software and do not contain any UNIX code in it.
Features of R
R programming language has the following main features:
 It is a simple and well-defined programming language that includes conditions, loops,
and recursive functions.
 It has data handling and data storage facility.
 It provides operators for array, matrices and vector calculations.
 It provides integrated set of tools for data analysis.
 It also provides static graphics to produce dynamic and interactive graphs.
Basic Syntax of R
For working with R, you first need to set up the environment for R. After the R environment is
set, you are ready to work with R command prompt. To start the R command prompt, type the
following command:
$ R
R interpreter will be launched where you will type your program with prompt > as follows:
Mystring <- “Hello World!”
Print(Mystring)
[1] “Hello World!”
R Script File
The programs are written in script files and then executed at command prompt using R
interpreter called Rscript.
In R language, variables are assigned R-Objects which are as follows:
 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
Working with Big Data in R
R language has been there for the last 20 years but it gained attention recently due to its capacity
to handle Big Data. R language provides series of packages and an environment for statistical
computation of Big Data. The project of programming with Big Data in R was developed a few
years ago. This project is mainly used for data profiling and distributed computing. R packages
and functions are available to load data from any source.
Hadoop is a Big Data technology to handle large amount of data. R and Hadoop can be
integrated together for Big Data analytics.
Why integrate R with Hadoop?
R is a very good programming language for statistical data analysis and to convert this data
analysis to interactive graphs. Although R is preferred programming language for statistics and
analysis, there are some drawbacks of this language also. In R programming language, a single
machine contains all the objects in the main memory. Large size of data cannot be loaded into
the RAM memory. Also, R is not scalable and this cause only limited amount of data to be
processed at a time. For this case, Hadoop is the perfect choice.
Hadoop is a distributed processing framework to perform operations and handle large datasets.
Hadoop already is a popular framework for Big Data processing and integrating it with R will
work wonders. This will make data analytics highly scalable such that the analytics platform can
be scaled up and scaled down depending upon the datasets. It will also provide cost value return.
How to integrate R with Hadoop?
R packages and R scripts are used by data scientists for data processing. These R packages and R
scripts need to be rewritten in Java language or any such programming language that implements
Hadoop MapReduce algorithm to use these scripts and packages with Hadoop. A software
written in R language is required with data stored on distributed storage Hadoop. Following are
some of the methods to integrate R with Hadoop:
1. RHADOOP – It is the most commonly used solution to integrate R with Hadoop. This
analytics solution allows user to directly take data from HBase database systems and
HDFS file systems. It also offers the advantage of simplicity and cost. It is a collection of
5 packages to manage and analyze data using programming language R. Following are
the 5 packages:
 Rhbase – This provides database management functions for HBase within R.
 Rhdfs – This package provides connectivity to Hadoop distributed file system.
 Plyrmr – This package provides data manipulation operations on large datasets.
 Ravro – This allows users to read and write Avro files from HDFS.
 Rmr2 – This is used to perform statistical analysis on data stored in Hadoop.
2. RHIPE – It is an acronym for R and Hadoop Integrated Programming Environment. It is
an R library that provides users the ability to MapReduce within R. It provides data
distribution scheme and integrates well with Hadoop.
3. R and Hadoop Streaming – Hadoop Streaming makes it possible for the user to run
MapReduce using any executable script. This script reads data from standard input and
writes data as a mapper or reducer. Hadoop Streaming can be integrated with R
programming scripts.
4. RHIVE – It is based on installing R on workstations and connecting to data in Hadoop.
RHIVE is the package to launch Hive queries. It has functions to retrieve metadata from
Apache Hive like database names, column names, and table names. RHIVE also provides
libraries and algorithms in R to the data stored in Hadoop. The main advantage of this is
parallelizing of operations.
5. ORCH – It is an acronym for Oracle Connector for Hadoop. It allows users to test
MapReduce program’s ability without any need of learning a new programming
language.
Considering all this, combination of R and Hadoop is a must to work with Big Data for faster,
better, and predictive analytics along with performance, scalability and flexibility.
Strategies of Big Data in R
Big Data can be tackled with R with the following strategies:
 Sampling – The size of data can be reduced using sampling if it is too big to be analyzed.
Sampling also decreases the performance in some cases.
 Bigger Hardware – R keeps all the objects in a single memory. Problem occurs if the
data is very large. To resolve this issue, machine’s memory can be increased and Big
Data can be handled easily.
 Storing objects on hard drive – Instead of storing data in memory, data objects can be
stored on hard disc using packages that are available. This data can be analyzed block
wise which leads to parallelization. This can be performed with only those algorithms
that are specifically designed for this purpose. ‘FF’ and ‘ffbase’ are the main packages
for this purpose.
 Integration of high performing programming languages – For better performance,
high performing programming languages can be integrated with R. Small components of
the program are transferred from R language to another language to prevent any risks. In
order to implement this strategy, developers need to be efficient in some other
programming language like Java and C++.
 Alternative Interpreters – To deal with Big Data, alternative interpreters can be used.
One such interpreter is pqR(pretty quick R). Another alternative is the Renjin which can
run on the JVM(Java Virtual Machine).

More Related Content

What's hot

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Python Libraries and Modules
Python Libraries and ModulesPython Libraries and Modules
Python Libraries and Modules
RaginiJain21
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
R programming
R programmingR programming
R programming
Shantanu Patil
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Hadoop
Hadoop Hadoop
Hadoop
ABHIJEET RAJ
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
Rsquared Academy
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
varshakumar21
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Avkash Chauhan
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
valuebound
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Distributed database management system
Distributed database management  systemDistributed database management  system
Distributed database management system
Pooja Dixit
 
Super-Computer Architecture
Super-Computer Architecture Super-Computer Architecture
Super-Computer Architecture
Vivek Garg
 
Object Oriented Database Management System
Object Oriented Database Management SystemObject Oriented Database Management System
Object Oriented Database Management System
Ajay Jha
 
Raid
RaidRaid
Big data in healthcare
Big data in healthcareBig data in healthcare
Big data in healthcare
DeZyre
 
Page replacement algorithms
Page replacement algorithmsPage replacement algorithms
Page replacement algorithms
Piyush Rochwani
 

What's hot (20)

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Python Libraries and Modules
Python Libraries and ModulesPython Libraries and Modules
Python Libraries and Modules
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
R programming
R programmingR programming
R programming
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Distributed database management system
Distributed database management  systemDistributed database management  system
Distributed database management system
 
Super-Computer Architecture
Super-Computer Architecture Super-Computer Architecture
Super-Computer Architecture
 
Object Oriented Database Management System
Object Oriented Database Management SystemObject Oriented Database Management System
Object Oriented Database Management System
 
Raid
RaidRaid
Raid
 
Big data in healthcare
Big data in healthcareBig data in healthcare
Big data in healthcare
 
Page replacement algorithms
Page replacement algorithmsPage replacement algorithms
Page replacement algorithms
 

Similar to Big Data - Analytics with R

R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
ShantilalBhayal1
 
Reason To learn & use r
Reason To learn & use rReason To learn & use r
Reason To learn & use r
Septian Pratama Rusmana
 
UNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdfUNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdf
Sweta Kumari Barnwal
 
2 it unit-1 start learning r
2 it   unit-1 start learning r2 it   unit-1 start learning r
2 it unit-1 start learning r
Netaji Gandi
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
VijayMohan Vasu
 
R as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationR as supporting tool for analytics and simulation
R as supporting tool for analytics and simulation
Alvaro Gil
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
hemasri56
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
Ajay Ohri
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
Rupak Roy
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
IJSRED
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
himanshu arora
 
Python vs. r for data science
Python vs. r for data sciencePython vs. r for data science
Python vs. r for data science
Hugo Shi
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
Bhaskara Reddy Sannapureddy
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
Great Wide Open
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
Derek Kane
 

Similar to Big Data - Analytics with R (20)

R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
Reason To learn & use r
Reason To learn & use rReason To learn & use r
Reason To learn & use r
 
UNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdfUNIT-1 Start Learning R.pdf
UNIT-1 Start Learning R.pdf
 
2 it unit-1 start learning r
2 it   unit-1 start learning r2 it   unit-1 start learning r
2 it unit-1 start learning r
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
R as supporting tool for analytics and simulation
R as supporting tool for analytics and simulationR as supporting tool for analytics and simulation
R as supporting tool for analytics and simulation
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Python vs. r for data science
Python vs. r for data sciencePython vs. r for data science
Python vs. r for data science
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 

More from Techsparks

Available Research Topics in Machine Learning
Available Research Topics in Machine LearningAvailable Research Topics in Machine Learning
Available Research Topics in Machine Learning
Techsparks
 
How to Complete your thesis fast.pdf
How to Complete your thesis fast.pdfHow to Complete your thesis fast.pdf
How to Complete your thesis fast.pdf
Techsparks
 
How to Plan Thesis on Computer Science.pdf
How to Plan Thesis on Computer Science.pdfHow to Plan Thesis on Computer Science.pdf
How to Plan Thesis on Computer Science.pdf
Techsparks
 
Latest IEEE base paper for CSE.pdf
Latest IEEE base paper for CSE.pdfLatest IEEE base paper for CSE.pdf
Latest IEEE base paper for CSE.pdf
Techsparks
 
Popular Thesis Topics in Networking
Popular Thesis Topics in NetworkingPopular Thesis Topics in Networking
Popular Thesis Topics in Networking
Techsparks
 
Significant Research Topics in Cloud Computing
Significant Research Topics in Cloud ComputingSignificant Research Topics in Cloud Computing
Significant Research Topics in Cloud Computing
Techsparks
 
Trending Topics in Machine Learning
Trending Topics in Machine LearningTrending Topics in Machine Learning
Trending Topics in Machine Learning
Techsparks
 
Masters thesis in assistance patna
Masters thesis in assistance patnaMasters thesis in assistance patna
Masters thesis in assistance patna
Techsparks
 
Software engineering - Topics and Research Areas
Software engineering - Topics and Research AreasSoftware engineering - Topics and Research Areas
Software engineering - Topics and Research Areas
Techsparks
 
Cloud computing and Cloud Security - Basics and Terminologies
Cloud computing and Cloud Security - Basics and TerminologiesCloud computing and Cloud Security - Basics and Terminologies
Cloud computing and Cloud Security - Basics and Terminologies
Techsparks
 
Data mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research TopicsData mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research Topics
Techsparks
 
How to write a thesis - Guidelines to Thesis Writing
How to write a thesis - Guidelines to Thesis WritingHow to write a thesis - Guidelines to Thesis Writing
How to write a thesis - Guidelines to Thesis Writing
Techsparks
 
Matlab - Introduction and Basics
Matlab - Introduction and BasicsMatlab - Introduction and Basics
Matlab - Introduction and Basics
Techsparks
 
Topics in digital communication
Topics in digital communicationTopics in digital communication
Topics in digital communication
Techsparks
 
Research paper writers in hyderabad
Research paper writers in hyderabadResearch paper writers in hyderabad
Research paper writers in hyderabad
Techsparks
 
Topics in wireless communication for project and thesis
Topics in wireless communication for project and thesisTopics in wireless communication for project and thesis
Topics in wireless communication for project and thesis
Techsparks
 
Masters thesis in assistance indore
Masters thesis in assistance indoreMasters thesis in assistance indore
Masters thesis in assistance indore
Techsparks
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
Techsparks
 
Master's thesis assistance jalandhar
Master's thesis assistance jalandharMaster's thesis assistance jalandhar
Master's thesis assistance jalandhar
Techsparks
 
How to get published in Scopus/ IEEE journals
How to get published in Scopus/ IEEE journalsHow to get published in Scopus/ IEEE journals
How to get published in Scopus/ IEEE journals
Techsparks
 

More from Techsparks (20)

Available Research Topics in Machine Learning
Available Research Topics in Machine LearningAvailable Research Topics in Machine Learning
Available Research Topics in Machine Learning
 
How to Complete your thesis fast.pdf
How to Complete your thesis fast.pdfHow to Complete your thesis fast.pdf
How to Complete your thesis fast.pdf
 
How to Plan Thesis on Computer Science.pdf
How to Plan Thesis on Computer Science.pdfHow to Plan Thesis on Computer Science.pdf
How to Plan Thesis on Computer Science.pdf
 
Latest IEEE base paper for CSE.pdf
Latest IEEE base paper for CSE.pdfLatest IEEE base paper for CSE.pdf
Latest IEEE base paper for CSE.pdf
 
Popular Thesis Topics in Networking
Popular Thesis Topics in NetworkingPopular Thesis Topics in Networking
Popular Thesis Topics in Networking
 
Significant Research Topics in Cloud Computing
Significant Research Topics in Cloud ComputingSignificant Research Topics in Cloud Computing
Significant Research Topics in Cloud Computing
 
Trending Topics in Machine Learning
Trending Topics in Machine LearningTrending Topics in Machine Learning
Trending Topics in Machine Learning
 
Masters thesis in assistance patna
Masters thesis in assistance patnaMasters thesis in assistance patna
Masters thesis in assistance patna
 
Software engineering - Topics and Research Areas
Software engineering - Topics and Research AreasSoftware engineering - Topics and Research Areas
Software engineering - Topics and Research Areas
 
Cloud computing and Cloud Security - Basics and Terminologies
Cloud computing and Cloud Security - Basics and TerminologiesCloud computing and Cloud Security - Basics and Terminologies
Cloud computing and Cloud Security - Basics and Terminologies
 
Data mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research TopicsData mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research Topics
 
How to write a thesis - Guidelines to Thesis Writing
How to write a thesis - Guidelines to Thesis WritingHow to write a thesis - Guidelines to Thesis Writing
How to write a thesis - Guidelines to Thesis Writing
 
Matlab - Introduction and Basics
Matlab - Introduction and BasicsMatlab - Introduction and Basics
Matlab - Introduction and Basics
 
Topics in digital communication
Topics in digital communicationTopics in digital communication
Topics in digital communication
 
Research paper writers in hyderabad
Research paper writers in hyderabadResearch paper writers in hyderabad
Research paper writers in hyderabad
 
Topics in wireless communication for project and thesis
Topics in wireless communication for project and thesisTopics in wireless communication for project and thesis
Topics in wireless communication for project and thesis
 
Masters thesis in assistance indore
Masters thesis in assistance indoreMasters thesis in assistance indore
Masters thesis in assistance indore
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
 
Master's thesis assistance jalandhar
Master's thesis assistance jalandharMaster's thesis assistance jalandhar
Master's thesis assistance jalandhar
 
How to get published in Scopus/ IEEE journals
How to get published in Scopus/ IEEE journalsHow to get published in Scopus/ IEEE journals
How to get published in Scopus/ IEEE journals
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Big Data - Analytics with R

  • 1. Big Data with R Big Data refers to the large volume of data which may be organized or unorganized. This big data is very essential for large organizations and businesses for valuable insights to determine futuristic trends. Big Data is defined in terms of 3Vs which are as follows: Volume – Volume refers to the quantity and amount of data and this data is increasing day by day. Facebook has more number of users than the entire population of China. Its data is also huge. The data is in the form of images, music, videos and all such stuff. Velocity – Velocity refers to the rate at which the data is generated. Again taking the example of Facebook, a huge amount of data is uploaded, shared each second on Facebook. People on social media want new information and content each time they log in to social media. Old obsolete news and information does not matter to them. Thus new information is shared at every second on social media. Variety – Coming to the third V of Big Data i.e. variety. Variety means diverse type of data. There are multiple formats of data that can be stored. The data can be in the form of image, video, text, pdf or excel. Big Data has a big challenge of managing this different type of data. An organization need to arrange similar format of data together in order to extract important information out of that. Why is Big Data Analytics important? Big Data and its analytics are important on account of the following reasons:  Reduction in Cost – Big Data Analytics offer cost advantages using technologies like Hadoop and Cloud Computing. These technologies help in storing and managing large amount of data.  Better Decision making – Using Hadoop and analytics, organizations and businesses are able to make better and faster decisions by analyzing different sources of data.
  • 2.  New services and product development – With the help of big data analytics, companies can measure customer behavior and needs. Using these parameters, companies can launch new products and services that will satisfy user needs. R Programming Language R is an open-source programming language and software environment for statistical study, graphical representation, and reporting. The R language is extensively used by statisticians and data miners for data analysis and software statistics. Robert Gentleman and Ross Ihaka are the two authors of this language. The language is named ‘R’ from the first letter of the name of these authors. R software environment’s source code is written mainly in C, FORTRAN, and R language. R is a GNU Package and is freely available under GNU General Public License. What is GNU? GNU is an acronym for “GNU’s Not Unix!” It is an operating system and is a collection of computer software. Its design is like UNIX but it differs from UNIX in the sense that it is a free software and do not contain any UNIX code in it. Features of R R programming language has the following main features:  It is a simple and well-defined programming language that includes conditions, loops, and recursive functions.  It has data handling and data storage facility.  It provides operators for array, matrices and vector calculations.  It provides integrated set of tools for data analysis.  It also provides static graphics to produce dynamic and interactive graphs.
  • 3. Basic Syntax of R For working with R, you first need to set up the environment for R. After the R environment is set, you are ready to work with R command prompt. To start the R command prompt, type the following command: $ R R interpreter will be launched where you will type your program with prompt > as follows: Mystring <- “Hello World!” Print(Mystring) [1] “Hello World!” R Script File The programs are written in script files and then executed at command prompt using R interpreter called Rscript. In R language, variables are assigned R-Objects which are as follows:  Vectors  Lists  Matrices  Arrays  Factors  Data Frames
  • 4. Working with Big Data in R R language has been there for the last 20 years but it gained attention recently due to its capacity to handle Big Data. R language provides series of packages and an environment for statistical computation of Big Data. The project of programming with Big Data in R was developed a few years ago. This project is mainly used for data profiling and distributed computing. R packages and functions are available to load data from any source. Hadoop is a Big Data technology to handle large amount of data. R and Hadoop can be integrated together for Big Data analytics. Why integrate R with Hadoop? R is a very good programming language for statistical data analysis and to convert this data analysis to interactive graphs. Although R is preferred programming language for statistics and analysis, there are some drawbacks of this language also. In R programming language, a single machine contains all the objects in the main memory. Large size of data cannot be loaded into the RAM memory. Also, R is not scalable and this cause only limited amount of data to be processed at a time. For this case, Hadoop is the perfect choice. Hadoop is a distributed processing framework to perform operations and handle large datasets. Hadoop already is a popular framework for Big Data processing and integrating it with R will work wonders. This will make data analytics highly scalable such that the analytics platform can be scaled up and scaled down depending upon the datasets. It will also provide cost value return. How to integrate R with Hadoop? R packages and R scripts are used by data scientists for data processing. These R packages and R scripts need to be rewritten in Java language or any such programming language that implements Hadoop MapReduce algorithm to use these scripts and packages with Hadoop. A software written in R language is required with data stored on distributed storage Hadoop. Following are some of the methods to integrate R with Hadoop: 1. RHADOOP – It is the most commonly used solution to integrate R with Hadoop. This analytics solution allows user to directly take data from HBase database systems and
  • 5. HDFS file systems. It also offers the advantage of simplicity and cost. It is a collection of 5 packages to manage and analyze data using programming language R. Following are the 5 packages:  Rhbase – This provides database management functions for HBase within R.  Rhdfs – This package provides connectivity to Hadoop distributed file system.  Plyrmr – This package provides data manipulation operations on large datasets.  Ravro – This allows users to read and write Avro files from HDFS.  Rmr2 – This is used to perform statistical analysis on data stored in Hadoop. 2. RHIPE – It is an acronym for R and Hadoop Integrated Programming Environment. It is an R library that provides users the ability to MapReduce within R. It provides data distribution scheme and integrates well with Hadoop. 3. R and Hadoop Streaming – Hadoop Streaming makes it possible for the user to run MapReduce using any executable script. This script reads data from standard input and writes data as a mapper or reducer. Hadoop Streaming can be integrated with R programming scripts. 4. RHIVE – It is based on installing R on workstations and connecting to data in Hadoop. RHIVE is the package to launch Hive queries. It has functions to retrieve metadata from Apache Hive like database names, column names, and table names. RHIVE also provides libraries and algorithms in R to the data stored in Hadoop. The main advantage of this is parallelizing of operations. 5. ORCH – It is an acronym for Oracle Connector for Hadoop. It allows users to test MapReduce program’s ability without any need of learning a new programming language. Considering all this, combination of R and Hadoop is a must to work with Big Data for faster, better, and predictive analytics along with performance, scalability and flexibility. Strategies of Big Data in R Big Data can be tackled with R with the following strategies:
  • 6.  Sampling – The size of data can be reduced using sampling if it is too big to be analyzed. Sampling also decreases the performance in some cases.  Bigger Hardware – R keeps all the objects in a single memory. Problem occurs if the data is very large. To resolve this issue, machine’s memory can be increased and Big Data can be handled easily.  Storing objects on hard drive – Instead of storing data in memory, data objects can be stored on hard disc using packages that are available. This data can be analyzed block wise which leads to parallelization. This can be performed with only those algorithms that are specifically designed for this purpose. ‘FF’ and ‘ffbase’ are the main packages for this purpose.  Integration of high performing programming languages – For better performance, high performing programming languages can be integrated with R. Small components of the program are transferred from R language to another language to prevent any risks. In order to implement this strategy, developers need to be efficient in some other programming language like Java and C++.  Alternative Interpreters – To deal with Big Data, alternative interpreters can be used. One such interpreter is pqR(pretty quick R). Another alternative is the Renjin which can run on the JVM(Java Virtual Machine).