In this Hands-On, we are going to show how you can use Apache Spark and some components of it ecosystem for data processing. This workshop is split in four parts. We will use a dataset that consists of tweets containing just a few fields like id, user, text, country and place.
In the first one, you will play with the Spark API for basic operations like counting, filtering, aggregating.
After that, you will get to know Spark SQL to query structured data (here in json) using SQL.
In the third part, you will use Spark Streaming and the twitter streaming API to analyse a live stream of Tweets.
To finish we will build a simple model to identify the language in a text. For that you will use MLLib.
Let's go and have fun !
Java > 6 (8 is better to use the lambdas)
Apache Spark https://spark.apache.org