SlideShare a Scribd company logo
1 of 1
Download to read offline
Introduction One can argue that the most challenging task in a Big Data setting is getting the data
that can then be used for data analysis and predictions. Towards this goal, in this assignment, you
will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive
table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS,
and Spark SQL for data analysis and Spark ML for prediction.
2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python
that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and
keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a
developer account with Twitter, which is free. The data should be formatted in a way that can be
easily ingested by the other components of the pipeline. There is a limit on the number of calls that
a producer can make to Twitter at any one time. Check the limitations and adjust your code so that
tweets are received continuously without going over the limit. Some sample code is provided for
setting up the producer as well online videos.
2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and
saves the data to HDFS. The consumer should be designed to handle large volumes of data and
should be fault-tolerant. Some sample Kafka consumers are available as well.
2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data.
Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on
Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1
2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put
into multiple columns. It is up to you how you want to clean the data, either in the consumer,
producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is
formatted in a way that can be easily loaded into Spark for later processing (see below).
2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for
analysis. Use the Scala DataFrame to run some queries on the data that you have read. The
queries will depend on the topic that you have chosen and keywords received from Twitter.
2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm
using Spark ML to predict whether the tweets that you have have ingested have positive sentiment
or negative sentiment. You can also choose other predictions depending on the topic.

More Related Content

Similar to Introduction One can argue that the most challenging task in.pdf

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
 
APACHE
APACHEAPACHE
APACHEARJUN
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 

Similar to Introduction One can argue that the most challenging task in.pdf (20)

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-RampUsing Apache NiFi with Apache Pulsar for Fast Data On-Ramp
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
 
APACHE
APACHEAPACHE
APACHE
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Bigdata
BigdataBigdata
Bigdata
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Kafka for Scale
Kafka for ScaleKafka for Scale
Kafka for Scale
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

More from adinathknit

It had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdfIt had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdfadinathknit
 
ISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdfISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdfadinathknit
 
IT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdfIT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdfadinathknit
 
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdfissurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdfadinathknit
 
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdfISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdfadinathknit
 
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdfIsabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdfadinathknit
 
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdfIsaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdfadinathknit
 
Is there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdfIs there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdfadinathknit
 
Is the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdfIs the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdfadinathknit
 
Is the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdfIs the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdfadinathknit
 
Is the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdfIs the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdfadinathknit
 
Is SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdfIs SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdfadinathknit
 
Is it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdfIs it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdfadinathknit
 
Is gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdfIs gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdfadinathknit
 
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdfirketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdfadinathknit
 
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdfirketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdfadinathknit
 
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdfIre ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdfadinathknit
 
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdfIP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdfadinathknit
 
Investors have a utility function over wealth of Retur.pdf
Investors have a utility function over wealth of   Retur.pdfInvestors have a utility function over wealth of   Retur.pdf
Investors have a utility function over wealth of Retur.pdfadinathknit
 
Investors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdfInvestors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdfadinathknit
 

More from adinathknit (20)

It had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdfIt had been seven weeks since her surgery and Marlene was On.pdf
It had been seven weeks since her surgery and Marlene was On.pdf
 
ISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdfISuppose that the marginal propensity to save is dya5057y+.pdf
ISuppose that the marginal propensity to save is dya5057y+.pdf
 
IT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdfIT 7113 Data Visualization and Dashboard Development Project.pdf
IT 7113 Data Visualization and Dashboard Development Project.pdf
 
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdfissurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
issurecoit thevend at teedes ERetiat a patin puinen ra soo .pdf
 
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdfISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
ISO 9000 uluslararas kalite standarddr Standardn en son dei.pdf
 
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdfIsabella Penrose la gerente general le pidi a la Sra Sul.pdf
Isabella Penrose la gerente general le pidi a la Sra Sul.pdf
 
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdfIsaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
Isaiah is a 9 vear old malewho was diaanosed with Cerebra.pdf
 
Is there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdfIs there convincing evidence that less than half of American.pdf
Is there convincing evidence that less than half of American.pdf
 
Is the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdfIs the following statement true or false and why this sta.pdf
Is the following statement true or false and why this sta.pdf
 
Is the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdfIs the coefficient 0 1 or not defined Please provide expl.pdf
Is the coefficient 0 1 or not defined Please provide expl.pdf
 
Is the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdfIs the drawn T the doubled version of T If not what is th.pdf
Is the drawn T the doubled version of T If not what is th.pdf
 
Is SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdfIs SPI filtering for packets that are part of ongoing commun.pdf
Is SPI filtering for packets that are part of ongoing commun.pdf
 
Is it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdfIs it possible to create a bottom navigation with two option.pdf
Is it possible to create a bottom navigation with two option.pdf
 
Is gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdfIs gain of function correct Colors indicate range of gene e.pdf
Is gain of function correct Colors indicate range of gene e.pdf
 
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdfirketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
irketlerin zellikleri ile ilgili olarak aadakilerden hangis.pdf
 
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdfirketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
irketimizin aadaki nakit aklarna sahip bir projesi bulunmakt.pdf
 
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdfIre ntanin Cedar of ontry into recinient cell Given the abo.pdf
Ire ntanin Cedar of ontry into recinient cell Given the abo.pdf
 
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdfIP Fragmentation Consider sending a 2000 byte datagram in.pdf
IP Fragmentation Consider sending a 2000 byte datagram in.pdf
 
Investors have a utility function over wealth of Retur.pdf
Investors have a utility function over wealth of   Retur.pdfInvestors have a utility function over wealth of   Retur.pdf
Investors have a utility function over wealth of Retur.pdf
 
Investors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdfInvestors have different risk preferences Most people are n.pdf
Investors have different risk preferences Most people are n.pdf
 

Recently uploaded

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 

Introduction One can argue that the most challenging task in.pdf

  • 1. Introduction One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction. 2 Instructions 2.1 Step 1: Setup Kafka producer to ingest tweets Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos. 2.2 Step 2: Setup Kafka Consumer Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well. 2.3 Step 3: Setup Flume Agent Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS. 1 2.4 Step 4: Clean and Process Data The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below). 2.5 Step 5: Load Data into Spark SQL Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter. 2.6 Step 6: Train a Spark ML algorithm Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic.