Introduction One can argue that the most challenging task in.pdf

•

0 likes•2 views

Introduction One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction. 2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos. 2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well. 2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1 2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below). 2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter. 2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic..

Education

Similar to Introduction One can argue that the most challenging task in.pdf

Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann

APACHEARJUN

Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.

Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain

Hadoop project design and a usecasesudhakara st

Apache kafkasureshraj43

Real time cloud native open source streaming of any data to apache solrTimothy Spann

OSSNA Building Modern Data Streaming AppsTimothy Spann

[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera

Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit

Apache kafkaKumar Shivam

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks

Apache frameworks for Big and Fast DataNaveen Korakoppa

Building Streaming Data Applications Using Apache KafkaSlim Baltagi

Apache Kafka - Scalable Message-Processing and more !Guido Schmutz

Bigdatasweetysweety8

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent

Kafka for ScaleEyal Ben Ivri

Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi

Apache kafkaJanu Jahnavi

Similar to Introduction One can argue that the most challenging task in.pdf (20)

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp

APACHE

Removing performance bottlenecks with Kafka Monitoring and topic configuration

Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...

Hadoop project design and a usecase

Apache kafka

Real time cloud native open source streaming of any data to apache solr

OSSNA Building Modern Data Streaming Apps

[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera

Apache kafka

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Apache frameworks for Big and Fast Data

Building Streaming Data Applications Using Apache Kafka

Apache Kafka - Scalable Message-Processing and more !

Bigdata

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

Kafka for Scale

Unified Batch and Real-Time Stream Processing Using Apache Flink

Apache kafka

Recently uploaded

MENTAL STATUS EXAMINATION format.docxPoojaSen20

Introduction to AI in Higher Education_draft.pptxpboyjonauth

Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN

Measures of Central Tendency: Mean, Median and ModeThiyagu K

URLs and Routing in the Odoo 17 Website AppCeline George

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani

Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1

Staff of Color (SOC) Retention Efforts DDSDDavid Douglas School District

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

microwave assisted reaction. General introductionMaksud Ahmed

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a

Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

Alper Gobel In Media Res Media ComponentInMediaRes1

The Most Excellent Way | 1 Corinthians 13Steve Thomason

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx

Introduction to AI in Higher Education_draft.pptx

Solving Puzzles Benefits Everyone (English).pptx

Measures of Central Tendency: Mean, Median and Mode

URLs and Routing in the Odoo 17 Website App

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991

Employee wellbeing at the workplace.pptx

Staff of Color (SOC) Retention Efforts DDSD

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

microwave assisted reaction. General introduction

CARE OF CHILD IN INCUBATOR..........pptx

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf

Q4-W6-Restating Informational Text Grade 3

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝

Alper Gobel In Media Res Media Component

The Most Excellent Way | 1 Corinthians 13

Grant Readiness 101 TechSoup and Remy Consulting

Concept of Vouching. B.Com(Hons) /B.Compdf

Introduction One can argue that the most challenging task in.pdf

1. Introduction One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction. 2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos. 2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well. 2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1 2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below). 2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter. 2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic.

Introduction One can argue that the most challenging task in.pdf

Recommended

Recommended

More Related Content

Similar to Introduction One can argue that the most challenging task in.pdf

Similar to Introduction One can argue that the most challenging task in.pdf (20)

More from adinathknit

More from adinathknit (20)

Recently uploaded

Recently uploaded (20)

Introduction One can argue that the most challenging task in.pdf