SlideShare a Scribd company logo
1 of 1
Download to read offline
1. http://hrboss.com/blog/2014-02-26/fundamentals-big-data-hr [Accessed 8/29/2014 at 11:57 AM]
Scope of the Research Internship Conclusion
Big Data Analytics in Manufacturing Faizan Cassim and Carl Berry
A C# application was developed to generate a large number of emails with a random: sender,
recipient, topic, email body a keyword with in the email and a punctuation mark. A number of
emails were combined into a single CSV file and there were several CSV files. Experimenting with
Big Data did not require the ubiquitously large data files, as would be used in real life.
The second part of the internship was about understanding how Big Data works. The Hortonworks
Hadoop platform was used, since it was the core framework of Microsoft’s Big Data
implementation in Azure (HD Insight). After completing the Hortonworks official suite of tutorials
on Hortonworks, the is able user to complete some basic tasks, such as importing data, using
queries ,and scripting iterative code using PIG. In addition to that the user is also able to use some
of the functions of Hadoop’s command line interface.
The third part of the internship looking into the application aspects of Big Data. A set of community
tutorials are available to compliment the official suite of tutorials which look at more advanced
topics covering, aspects such as reading data from various sources including Social Networks etc.
The scripting languages of Pig ,Hive and MapReduce were covered in detail. The final part of the
internship was about analyzing and designing Map Reduce programs. Map Reduce
implementations in Java, C# and Pig were looked into.
 Hive QL and Pig are limited in their functionality, compared to a traditional MapReduce program.
 MapReduce can be written in C#, provided that the user runs Microsoft’s instance of Hadoop
(HD Insight).
 Third Party software like Spring xD can be used to import Social Networks data into Hadoop .
 Specialist Hardware/ Software is not needed to run a Hadoop implementation.
 A Hortonworks (or HD Insight) user can export data into Microsoft Excel 2013 for further analysis.
 Some instances of Hadoop provide visualization tools that help the user make better sense of the data.
If this is not the case the user can opt to use 3rd party tools in any instance of Hadoop, to do the same.
What is Big Data ?
According to Gartner’s 3V model1
, Big Data can be defined as having Volume, Velocity and/ or
Variety, where; volume represents the large scale of data (typically ranging into terabytes),
velocity represents the rate at which data is produced (the computer networks at CERN
processes 10GB of data every second), variety represents the diversity of the data, which can
often be an image, video or even a piece of text. Social networks are famous for having
variety, as data about people can often come in the form of images and videos, apart from
text. Another important implication of Big Data is the ability to handle semi or unstructured
data, hence proving traditional RDBMS methods inefficient for the job. Big Data can run on
commodity hardware with any network infrastructure.
Most Big Data implementations run on the Hadoop Data Platform and uses a network infrastructure
consisting of name and data nodes to perform the task of data processing. The Hadoop Data Platform itself
has several aspects in the mining and processing of Big Data, these include: HDFS– the file storage system
in Hadoop, Pig– a simplified programing language that is used to write MapReduce programs, Hive– a data
warehouse application used in Hadoop, it has it’s own SQL like language called HiveQL, HBase– is a
database application and HCatalog– an application used to create tables and databases. MapReduce is a
data sorting methodology that was developed by Google and was implemented into Hadoop by Yahoo.
Potential Implementations
 Designing efficient algorithms that detect and remove anomalies from data .
 Developing an instance of Hadoop that supports ‘just-in-time’ manufacturing.
 Developing MapReduce algorithms targeted towards manufacturing data (i.e. time based data)
 General data mining

More Related Content

What's hot

RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Rapid software evolution
Rapid software evolutionRapid software evolution
Rapid software evolution
borislav
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
Lewis Crawford
 
מצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכותמצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכות
Pini Mandel
 

What's hot (19)

RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Rapid software evolution
Rapid software evolutionRapid software evolution
Rapid software evolution
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingBig data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph Analytics
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
מצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכותמצגת כנס מנתחי מערכות
מצגת כנס מנתחי מערכות
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 

Similar to Research Poster

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
Silicon Halton
 

Similar to Research Poster (20)

IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
HimaBindu
HimaBinduHimaBindu
HimaBindu
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop J.G.Rohini II M.Sc.,computer science Bon secours college for women
Hadoop J.G.Rohini II M.Sc.,computer science Bon secours college for womenHadoop J.G.Rohini II M.Sc.,computer science Bon secours college for women
Hadoop J.G.Rohini II M.Sc.,computer science Bon secours college for women
 
Resume_Karthick
Resume_KarthickResume_Karthick
Resume_Karthick
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 

Research Poster

  • 1. 1. http://hrboss.com/blog/2014-02-26/fundamentals-big-data-hr [Accessed 8/29/2014 at 11:57 AM] Scope of the Research Internship Conclusion Big Data Analytics in Manufacturing Faizan Cassim and Carl Berry A C# application was developed to generate a large number of emails with a random: sender, recipient, topic, email body a keyword with in the email and a punctuation mark. A number of emails were combined into a single CSV file and there were several CSV files. Experimenting with Big Data did not require the ubiquitously large data files, as would be used in real life. The second part of the internship was about understanding how Big Data works. The Hortonworks Hadoop platform was used, since it was the core framework of Microsoft’s Big Data implementation in Azure (HD Insight). After completing the Hortonworks official suite of tutorials on Hortonworks, the is able user to complete some basic tasks, such as importing data, using queries ,and scripting iterative code using PIG. In addition to that the user is also able to use some of the functions of Hadoop’s command line interface. The third part of the internship looking into the application aspects of Big Data. A set of community tutorials are available to compliment the official suite of tutorials which look at more advanced topics covering, aspects such as reading data from various sources including Social Networks etc. The scripting languages of Pig ,Hive and MapReduce were covered in detail. The final part of the internship was about analyzing and designing Map Reduce programs. Map Reduce implementations in Java, C# and Pig were looked into.  Hive QL and Pig are limited in their functionality, compared to a traditional MapReduce program.  MapReduce can be written in C#, provided that the user runs Microsoft’s instance of Hadoop (HD Insight).  Third Party software like Spring xD can be used to import Social Networks data into Hadoop .  Specialist Hardware/ Software is not needed to run a Hadoop implementation.  A Hortonworks (or HD Insight) user can export data into Microsoft Excel 2013 for further analysis.  Some instances of Hadoop provide visualization tools that help the user make better sense of the data. If this is not the case the user can opt to use 3rd party tools in any instance of Hadoop, to do the same. What is Big Data ? According to Gartner’s 3V model1 , Big Data can be defined as having Volume, Velocity and/ or Variety, where; volume represents the large scale of data (typically ranging into terabytes), velocity represents the rate at which data is produced (the computer networks at CERN processes 10GB of data every second), variety represents the diversity of the data, which can often be an image, video or even a piece of text. Social networks are famous for having variety, as data about people can often come in the form of images and videos, apart from text. Another important implication of Big Data is the ability to handle semi or unstructured data, hence proving traditional RDBMS methods inefficient for the job. Big Data can run on commodity hardware with any network infrastructure. Most Big Data implementations run on the Hadoop Data Platform and uses a network infrastructure consisting of name and data nodes to perform the task of data processing. The Hadoop Data Platform itself has several aspects in the mining and processing of Big Data, these include: HDFS– the file storage system in Hadoop, Pig– a simplified programing language that is used to write MapReduce programs, Hive– a data warehouse application used in Hadoop, it has it’s own SQL like language called HiveQL, HBase– is a database application and HCatalog– an application used to create tables and databases. MapReduce is a data sorting methodology that was developed by Google and was implemented into Hadoop by Yahoo. Potential Implementations  Designing efficient algorithms that detect and remove anomalies from data .  Developing an instance of Hadoop that supports ‘just-in-time’ manufacturing.  Developing MapReduce algorithms targeted towards manufacturing data (i.e. time based data)  General data mining