SlideShare a Scribd company logo
Introduction One can argue that the most challenging task in a Big Data setting is getting the data
that can then be used for data analysis and predictions. Towards this goal, in this assignment, you
will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive
table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS,
and Spark SQL for data analysis and Spark ML for prediction.
2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python
that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and
keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a
developer account with Twitter, which is free. The data should be formatted in a way that can be
easily ingested by the other components of the pipeline. There is a limit on the number of calls that
a producer can make to Twitter at any one time. Check the limitations and adjust your code so that
tweets are received continuously without going over the limit. Some sample code is provided for
setting up the producer as well online videos.
2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and
saves the data to HDFS. The consumer should be designed to handle large volumes of data and
should be fault-tolerant. Some sample Kafka consumers are available as well.
2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data.
Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on
Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1
2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put
into multiple columns. It is up to you how you want to clean the data, either in the consumer,
producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is
formatted in a way that can be easily loaded into Spark for later processing (see below).
2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for
analysis. Use the Scala DataFrame to run some queries on the data that you have read. The
queries will depend on the topic that you have chosen and keywords received from Twitter.
2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm
using Spark ML to predict whether the tweets that you have have ingested have positive sentiment
or negative sentiment. You can also choose other predictions depending on the topic.

More Related Content

Similar to Introduction One can argue that the most challenging task in.pdf

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Timothy Spann
 
APACHE
APACHEAPACHE
APACHEARJUN
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
sureshraj43
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Spark Summit
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Kumar Shivam
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Bigdata
BigdataBigdata
Bigdata
sweetysweety8
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
confluent
 
Kafka for Scale
Kafka for ScaleKafka for Scale
Kafka for Scale
Eyal Ben Ivri
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Janu Jahnavi
 

Similar to Introduction One can argue that the most challenging task in.pdf (20)

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
 
APACHE
APACHEAPACHE
APACHE
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Bigdata
BigdataBigdata
Bigdata
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Kafka for Scale
Kafka for ScaleKafka for Scale
Kafka for Scale
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

More from adinathknit

It had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdfIt had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdf
adinathknit
 
ISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdfISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdf
adinathknit
 
IT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdfIT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdf
adinathknit
 
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdfissurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
adinathknit
 
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdfISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
adinathknit
 
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdfIsabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
adinathknit
 
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdfIsaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
adinathknit
 
Is there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdfIs there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdf
adinathknit
 
Is the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdfIs the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdf
adinathknit
 
Is the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdfIs the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdf
adinathknit
 
Is the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdfIs the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdf
adinathknit
 
Is SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdfIs SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdf
adinathknit
 
Is it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdfIs it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdf
adinathknit
 
Is gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdfIs gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdf
adinathknit
 
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdfirketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
adinathknit
 
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdfirketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
adinathknit
 
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdfIre ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
adinathknit
 
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdfIP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
adinathknit
 
Investors have a utility function over wealth of Retur.pdf
Investors have a utility function over wealth of   Retur.pdfInvestors have a utility function over wealth of   Retur.pdf
Investors have a utility function over wealth of Retur.pdf
adinathknit
 
Investors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdfInvestors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdf
adinathknit
 

More from adinathknit (20)

It had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdfIt had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdf
 
ISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdfISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdf
 
IT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdfIT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdf
 
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdfissurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
 
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdfISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
 
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdfIsabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
 
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdfIsaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
 
Is there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdfIs there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdf
 
Is the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdfIs the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdf
 
Is the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdfIs the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdf
 
Is the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdfIs the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdf
 
Is SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdfIs SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdf
 
Is it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdfIs it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdf
 
Is gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdfIs gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdf
 
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdfirketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
 
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdfirketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
 
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdfIre ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
 
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdfIP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
 
Investors have a utility function over wealth of Retur.pdf
Investors have a utility function over wealth of   Retur.pdfInvestors have a utility function over wealth of   Retur.pdf
Investors have a utility function over wealth of Retur.pdf
 
Investors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdfInvestors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdf
 

Recently uploaded

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 

Recently uploaded (20)

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 

Introduction One can argue that the most challenging task in.pdf

  • 1. Introduction One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction. 2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos. 2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well. 2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1 2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below). 2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter. 2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic.