Harvesting Twitter Data Using Big Data Techniques
Sridhar Mamella
MSc – Big Data and Business Intelligence | 2014 - 2015
Introduction
To provide analytical skills to study big data and provide a
solid foundation for developing solutions and applications
that are needed to manipulate big data.
Results
Word Cloud depicting the most used English Tweet words
Heat Maps illustrating the most active tweeting regions
Conclusion
•  Evaluated the common statistical analysis and machine
learning techniques used to manipulate data
•  Utilised current big data technologies
•  Selected and employed an appropriate tool
Technologies
•  R
•  Hadoop
•  Hive
•  Pig
There are modules contained within the Hadoop project —
Hadoop Common, Hadoop Distributed File System, Hadoop
YARN and Hadoop MapReduce.
Together these systems give users the tools to support
additional Hadoop projects, along with the ability to process
large data sets in real time while automatically scheduling
jobs and managing cluster resources.
To complement the Hadoop modules there are also a variety
of other projects that provide specialized services.
•  Apache Hive
•  Apache Spark
•  Apache Ambari
•  Apache Pig
•  Apache Hbase
Future Work
•  Work with more complex datasets
•  Use Apache Sqoop
•  Implement r-Hadoop
Hadoop Eco-system
Analytics
•  Working with R
•  Building Matrices
•  K-means algorithm
•  Creating Word Clouds
Hadoop
•  HDFS
•  Hive – HQL
•  Pig – Pig Latin

Big Data

  • 1.
    Harvesting Twitter DataUsing Big Data Techniques Sridhar Mamella MSc – Big Data and Business Intelligence | 2014 - 2015 Introduction To provide analytical skills to study big data and provide a solid foundation for developing solutions and applications that are needed to manipulate big data. Results Word Cloud depicting the most used English Tweet words Heat Maps illustrating the most active tweeting regions Conclusion •  Evaluated the common statistical analysis and machine learning techniques used to manipulate data •  Utilised current big data technologies •  Selected and employed an appropriate tool Technologies •  R •  Hadoop •  Hive •  Pig There are modules contained within the Hadoop project — Hadoop Common, Hadoop Distributed File System, Hadoop YARN and Hadoop MapReduce. Together these systems give users the tools to support additional Hadoop projects, along with the ability to process large data sets in real time while automatically scheduling jobs and managing cluster resources. To complement the Hadoop modules there are also a variety of other projects that provide specialized services. •  Apache Hive •  Apache Spark •  Apache Ambari •  Apache Pig •  Apache Hbase Future Work •  Work with more complex datasets •  Use Apache Sqoop •  Implement r-Hadoop Hadoop Eco-system Analytics •  Working with R •  Building Matrices •  K-means algorithm •  Creating Word Clouds Hadoop •  HDFS •  Hive – HQL •  Pig – Pig Latin