SlideShare a Scribd company logo
1 of 1
Download to read offline
1 Introduction One can argue that the most challenging task in a Big Data setting is getting the
data that can then be used for data analysis and predictions. Towards this goal, in this
assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and
load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data
ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction. 2 Instructions
2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python that gets data
from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are
up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer
account with Twitter, which is free. The data should be formatted in a way that can be easily
ingested by the other components of the pipeline. There is a limit on the number of calls that a
producer can make to Twitter at any one time. Check the limitations and adjust your code so that
tweets are received continuously without going over the limit. Some sample code is provided for
setting up the producer as well online videos.
2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and
saves the data to HDFS. The consumer should be designed to handle large volumes of data and
should be fault-tolerant. Some sample Kafka consumers are available as well.
2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data.
Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on
Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1
2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put
into multiple columns. It is up to you how you want to clean the data, either in the consumer,
producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is
formatted in a way that can be easily loaded into Spark for later processing (see below).
2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for
analysis. Use the Scala DataFrame to run some queries on the data that you have read. The
queries will depend on the topic that you have chosen and keywords received from Twitter.
2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm
using Spark ML to predict whether the tweets that you have have ingested have positive sentiment
or negative sentiment. You can also choose other predictions depending on the topic.

More Related Content

Similar to 1 Introduction One can argue that the most challenging task .pdf

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
APACHE
APACHEAPACHE
APACHEARJUN
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 

Similar to 1 Introduction One can argue that the most challenging task .pdf (20)

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
APACHE
APACHEAPACHE
APACHE
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka for Scale
Kafka for ScaleKafka for Scale
Kafka for Scale
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Data Integration
Data IntegrationData Integration
Data Integration
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 

More from cbholla1

1 Let X1 and X2 be independent and identically distributed .pdf
1 Let X1 and X2 be independent and identically distributed .pdf1 Let X1 and X2 be independent and identically distributed .pdf
1 Let X1 and X2 be independent and identically distributed .pdfcbholla1
 
1 Let X and Y have joint distributions function fxy93x.pdf
1 Let X and Y have joint distributions function fxy93x.pdf1 Let X and Y have joint distributions function fxy93x.pdf
1 Let X and Y have joint distributions function fxy93x.pdfcbholla1
 
1 Let X be a random variable with the cumulative distributi.pdf
1 Let X be a random variable with the cumulative distributi.pdf1 Let X be a random variable with the cumulative distributi.pdf
1 Let X be a random variable with the cumulative distributi.pdfcbholla1
 
1 Let x1x2xn form a random sample from a distribution g.pdf
1 Let x1x2xn form a random sample from a distribution g.pdf1 Let x1x2xn form a random sample from a distribution g.pdf
1 Let x1x2xn form a random sample from a distribution g.pdfcbholla1
 
1 Let X1Xn be a random sample from the Bernoulli distrib.pdf
1 Let X1Xn be a random sample from the Bernoulli distrib.pdf1 Let X1Xn be a random sample from the Bernoulli distrib.pdf
1 Let X1Xn be a random sample from the Bernoulli distrib.pdfcbholla1
 
1 Let X and Y be random variables defined on a common proba.pdf
1 Let X and Y be random variables defined on a common proba.pdf1 Let X and Y be random variables defined on a common proba.pdf
1 Let X and Y be random variables defined on a common proba.pdfcbholla1
 
1 Las personas que aprenden otro idioma cuando son nios p.pdf
1 Las personas que aprenden otro idioma cuando son nios p.pdf1 Las personas que aprenden otro idioma cuando son nios p.pdf
1 Las personas que aprenden otro idioma cuando son nios p.pdfcbholla1
 
1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf
1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf
1 La Sra Elle de 80 aos de edad es una paciente a quien.pdfcbholla1
 
1 La sntesis replicacin del ADN podra describirse para.pdf
1 La sntesis replicacin del ADN podra describirse para.pdf1 La sntesis replicacin del ADN podra describirse para.pdf
1 La sntesis replicacin del ADN podra describirse para.pdfcbholla1
 
1 Las caractersticas en dos grupos diferentes de organismo.pdf
1 Las caractersticas en dos grupos diferentes de organismo.pdf1 Las caractersticas en dos grupos diferentes de organismo.pdf
1 Las caractersticas en dos grupos diferentes de organismo.pdfcbholla1
 
1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf
1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf
1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdfcbholla1
 
1 Label the following diagram of the brachial plexus on Figu.pdf
1 Label the following diagram of the brachial plexus on Figu.pdf1 Label the following diagram of the brachial plexus on Figu.pdf
1 Label the following diagram of the brachial plexus on Figu.pdfcbholla1
 
1 Las acciones del presidente estadounidense Richard Nixon .pdf
1 Las acciones del presidente estadounidense Richard Nixon .pdf1 Las acciones del presidente estadounidense Richard Nixon .pdf
1 Las acciones del presidente estadounidense Richard Nixon .pdfcbholla1
 
1 La rotacin funcional suele resultar positiva para las or.pdf
1 La rotacin funcional suele resultar positiva para las or.pdf1 La rotacin funcional suele resultar positiva para las or.pdf
1 La rotacin funcional suele resultar positiva para las or.pdfcbholla1
 
1 In how many ways can you arrange the letters in the word.pdf
1 In how many ways can you arrange the letters in the word.pdf1 In how many ways can you arrange the letters in the word.pdf
1 In how many ways can you arrange the letters in the word.pdfcbholla1
 
1 La Sra Richards est embarazada y ha habido una posibili.pdf
1 La Sra Richards est embarazada y ha habido una posibili.pdf1 La Sra Richards est embarazada y ha habido una posibili.pdf
1 La Sra Richards est embarazada y ha habido una posibili.pdfcbholla1
 
1 La forma qumica del ozono es a O b O2 c O 3 d.pdf
1 La forma qumica del ozono es  a O  b O2   c O 3  d.pdf1 La forma qumica del ozono es  a O  b O2   c O 3  d.pdf
1 La forma qumica del ozono es a O b O2 c O 3 d.pdfcbholla1
 
1 In the 1950 s many scientists thought that proteins not.pdf
1 In the 1950 s many scientists thought that proteins not.pdf1 In the 1950 s many scientists thought that proteins not.pdf
1 In the 1950 s many scientists thought that proteins not.pdfcbholla1
 
1 La actividad principal de _____ es suscripcin A El di.pdf
1 La actividad principal de _____ es suscripcin  A El di.pdf1 La actividad principal de _____ es suscripcin  A El di.pdf
1 La actividad principal de _____ es suscripcin A El di.pdfcbholla1
 
1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf
1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf
1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdfcbholla1
 

More from cbholla1 (20)

1 Let X1 and X2 be independent and identically distributed .pdf
1 Let X1 and X2 be independent and identically distributed .pdf1 Let X1 and X2 be independent and identically distributed .pdf
1 Let X1 and X2 be independent and identically distributed .pdf
 
1 Let X and Y have joint distributions function fxy93x.pdf
1 Let X and Y have joint distributions function fxy93x.pdf1 Let X and Y have joint distributions function fxy93x.pdf
1 Let X and Y have joint distributions function fxy93x.pdf
 
1 Let X be a random variable with the cumulative distributi.pdf
1 Let X be a random variable with the cumulative distributi.pdf1 Let X be a random variable with the cumulative distributi.pdf
1 Let X be a random variable with the cumulative distributi.pdf
 
1 Let x1x2xn form a random sample from a distribution g.pdf
1 Let x1x2xn form a random sample from a distribution g.pdf1 Let x1x2xn form a random sample from a distribution g.pdf
1 Let x1x2xn form a random sample from a distribution g.pdf
 
1 Let X1Xn be a random sample from the Bernoulli distrib.pdf
1 Let X1Xn be a random sample from the Bernoulli distrib.pdf1 Let X1Xn be a random sample from the Bernoulli distrib.pdf
1 Let X1Xn be a random sample from the Bernoulli distrib.pdf
 
1 Let X and Y be random variables defined on a common proba.pdf
1 Let X and Y be random variables defined on a common proba.pdf1 Let X and Y be random variables defined on a common proba.pdf
1 Let X and Y be random variables defined on a common proba.pdf
 
1 Las personas que aprenden otro idioma cuando son nios p.pdf
1 Las personas que aprenden otro idioma cuando son nios p.pdf1 Las personas que aprenden otro idioma cuando son nios p.pdf
1 Las personas que aprenden otro idioma cuando son nios p.pdf
 
1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf
1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf
1 La Sra Elle de 80 aos de edad es una paciente a quien.pdf
 
1 La sntesis replicacin del ADN podra describirse para.pdf
1 La sntesis replicacin del ADN podra describirse para.pdf1 La sntesis replicacin del ADN podra describirse para.pdf
1 La sntesis replicacin del ADN podra describirse para.pdf
 
1 Las caractersticas en dos grupos diferentes de organismo.pdf
1 Las caractersticas en dos grupos diferentes de organismo.pdf1 Las caractersticas en dos grupos diferentes de organismo.pdf
1 Las caractersticas en dos grupos diferentes de organismo.pdf
 
1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf
1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf
1 La tincin de viabilidad VIVOMUERTO se ha utilizado para.pdf
 
1 Label the following diagram of the brachial plexus on Figu.pdf
1 Label the following diagram of the brachial plexus on Figu.pdf1 Label the following diagram of the brachial plexus on Figu.pdf
1 Label the following diagram of the brachial plexus on Figu.pdf
 
1 Las acciones del presidente estadounidense Richard Nixon .pdf
1 Las acciones del presidente estadounidense Richard Nixon .pdf1 Las acciones del presidente estadounidense Richard Nixon .pdf
1 Las acciones del presidente estadounidense Richard Nixon .pdf
 
1 La rotacin funcional suele resultar positiva para las or.pdf
1 La rotacin funcional suele resultar positiva para las or.pdf1 La rotacin funcional suele resultar positiva para las or.pdf
1 La rotacin funcional suele resultar positiva para las or.pdf
 
1 In how many ways can you arrange the letters in the word.pdf
1 In how many ways can you arrange the letters in the word.pdf1 In how many ways can you arrange the letters in the word.pdf
1 In how many ways can you arrange the letters in the word.pdf
 
1 La Sra Richards est embarazada y ha habido una posibili.pdf
1 La Sra Richards est embarazada y ha habido una posibili.pdf1 La Sra Richards est embarazada y ha habido una posibili.pdf
1 La Sra Richards est embarazada y ha habido una posibili.pdf
 
1 La forma qumica del ozono es a O b O2 c O 3 d.pdf
1 La forma qumica del ozono es  a O  b O2   c O 3  d.pdf1 La forma qumica del ozono es  a O  b O2   c O 3  d.pdf
1 La forma qumica del ozono es a O b O2 c O 3 d.pdf
 
1 In the 1950 s many scientists thought that proteins not.pdf
1 In the 1950 s many scientists thought that proteins not.pdf1 In the 1950 s many scientists thought that proteins not.pdf
1 In the 1950 s many scientists thought that proteins not.pdf
 
1 La actividad principal de _____ es suscripcin A El di.pdf
1 La actividad principal de _____ es suscripcin  A El di.pdf1 La actividad principal de _____ es suscripcin  A El di.pdf
1 La actividad principal de _____ es suscripcin A El di.pdf
 
1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf
1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf
1 Kritik insan kaynaklar sorunlarnn zlmesine yardmc olma.pdf
 

Recently uploaded

demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxMohamed Rizk Khodair
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjMohammed Sikander
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024Borja Sotomayor
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdfVikramadityaRaj
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptxPoojaSen20
 
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptxmanishaJyala2
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxheathfieldcps1
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatmentsaipooja36
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxMarlene Maheu
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the lifeNitinDeodare
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...Nguyen Thanh Tu Collection
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxCeline George
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project researchCaitlinCummins3
 
Software testing for project report .pdf
Software testing for project report .pdfSoftware testing for project report .pdf
Software testing for project report .pdfKamal Acharya
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Celine George
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Celine George
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Celine George
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...Krashi Coaching
 

Recently uploaded (20)

demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptx
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptx
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
Software testing for project report .pdf
Software testing for project report .pdfSoftware testing for project report .pdf
Software testing for project report .pdf
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
 

1 Introduction One can argue that the most challenging task .pdf

  • 1. 1 Introduction One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction. 2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos. 2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well. 2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1 2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below). 2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter. 2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic.