Learn About Big Data and Hadoop The Most Significant Resource
Research Poster
1. 1. http://hrboss.com/blog/2014-02-26/fundamentals-big-data-hr [Accessed 8/29/2014 at 11:57 AM]
Scope of the Research Internship Conclusion
Big Data Analytics in Manufacturing Faizan Cassim and Carl Berry
A C# application was developed to generate a large number of emails with a random: sender,
recipient, topic, email body a keyword with in the email and a punctuation mark. A number of
emails were combined into a single CSV file and there were several CSV files. Experimenting with
Big Data did not require the ubiquitously large data files, as would be used in real life.
The second part of the internship was about understanding how Big Data works. The Hortonworks
Hadoop platform was used, since it was the core framework of Microsoft’s Big Data
implementation in Azure (HD Insight). After completing the Hortonworks official suite of tutorials
on Hortonworks, the is able user to complete some basic tasks, such as importing data, using
queries ,and scripting iterative code using PIG. In addition to that the user is also able to use some
of the functions of Hadoop’s command line interface.
The third part of the internship looking into the application aspects of Big Data. A set of community
tutorials are available to compliment the official suite of tutorials which look at more advanced
topics covering, aspects such as reading data from various sources including Social Networks etc.
The scripting languages of Pig ,Hive and MapReduce were covered in detail. The final part of the
internship was about analyzing and designing Map Reduce programs. Map Reduce
implementations in Java, C# and Pig were looked into.
Hive QL and Pig are limited in their functionality, compared to a traditional MapReduce program.
MapReduce can be written in C#, provided that the user runs Microsoft’s instance of Hadoop
(HD Insight).
Third Party software like Spring xD can be used to import Social Networks data into Hadoop .
Specialist Hardware/ Software is not needed to run a Hadoop implementation.
A Hortonworks (or HD Insight) user can export data into Microsoft Excel 2013 for further analysis.
Some instances of Hadoop provide visualization tools that help the user make better sense of the data.
If this is not the case the user can opt to use 3rd party tools in any instance of Hadoop, to do the same.
What is Big Data ?
According to Gartner’s 3V model1
, Big Data can be defined as having Volume, Velocity and/ or
Variety, where; volume represents the large scale of data (typically ranging into terabytes),
velocity represents the rate at which data is produced (the computer networks at CERN
processes 10GB of data every second), variety represents the diversity of the data, which can
often be an image, video or even a piece of text. Social networks are famous for having
variety, as data about people can often come in the form of images and videos, apart from
text. Another important implication of Big Data is the ability to handle semi or unstructured
data, hence proving traditional RDBMS methods inefficient for the job. Big Data can run on
commodity hardware with any network infrastructure.
Most Big Data implementations run on the Hadoop Data Platform and uses a network infrastructure
consisting of name and data nodes to perform the task of data processing. The Hadoop Data Platform itself
has several aspects in the mining and processing of Big Data, these include: HDFS– the file storage system
in Hadoop, Pig– a simplified programing language that is used to write MapReduce programs, Hive– a data
warehouse application used in Hadoop, it has it’s own SQL like language called HiveQL, HBase– is a
database application and HCatalog– an application used to create tables and databases. MapReduce is a
data sorting methodology that was developed by Google and was implemented into Hadoop by Yahoo.
Potential Implementations
Designing efficient algorithms that detect and remove anomalies from data .
Developing an instance of Hadoop that supports ‘just-in-time’ manufacturing.
Developing MapReduce algorithms targeted towards manufacturing data (i.e. time based data)
General data mining