SlideShare a Scribd company logo
#CASHTAG
BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS
Shafi Bashar
MOTIVATION
• People have opinions
• Different sources, different mediums -Twitter, Reddit, Facebook etc.
• Platform for aggregating opinions and analyzing on aTopic
• v 1.0: User’s opinion of US stock market
DEMO
•Webpage
http://www.hashtagcashtag.com
•Video
http://youtu.be/7oMrJ7n1Hr4
• Alternate Link
http://54.67.108.50
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer Serving Layer Front End
Real-timeView
BatchView
Data Ingestion
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
DATA INGESTION
• Two sources
1. Twitter Data
2. Stock Data
• Twitter Data from streaming API
{u'contributors': None,
u'coordinates': None,
u'created_at': u'Mon Feb 02 07:41:06 +0000 2015',
u'entities': {u'hashtags': [],
u'symbols': [{u'indices': [0, 3], u'text': u'FB'}],
u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026',
u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might-
be-googles-most-successful-failure-yet.html',
u'indices': [67, 89],
u'url': u'http://t.co/6iY3WYz82M'}],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'geo': None,
u'id': 562153724219764737,
u'id_str': u'562153724219764737',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
u'place': None,
u'possibly_sensitive': False,
u'retweet_count': 0,
u'retweeted': False,
u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>',
u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/
6iY3WYz82M",
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Mon Nov 17 20:15:38 +0000 2014',
u'default_profile': True,
DATA INGESTION
• Stock Data from www.netfonds.no
• Incremental CSV file for each individual stocks
• Preprocessing to add ticker and time stamp
• Multi topic, multi consumer Kafka
20150126T153000 113.67 100 Auto trade
20150126T153000 113.65 161 Auto trade
20150126T153000 113.68 270 Auto trade
20150126T153000 113.67 100 Auto trade
20150126T153001 113.66 100 Auto trade
20150126T153001 113.65 100 Auto trade
20150126T153001 113.67 100 Auto trade
1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,,
1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,,
1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
BATCH LAYER
• Spark batch job (written in Scala)
• Twitter
• Number of mentions and sentiment of the mentions / time
granularity
• Top trending stocks
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SENTIMENT ANALYSIS
"downgraded",
"bears",
"bear",
"bearish",
"volatile",
"short",
"sell",
"selling",
"forget",
"down",
"resistance",
"sold",
…
"upgrade",
"upgraded",
"long",
"buy",
"buying",
"growth",
"good",
"gained",
"well",
"great",
"nice",
"top",
…
Positive
Negative
BATCH LAYER
• Stocks
• high, low, open, close, volume
• Azkaban controls the flow and scheduling
• Batch layer uses Re-computation Algorithm
ticker | year | month | day | hour | minute | close | high | low | open | volume
--------+------+-------+-----+------+--------+--------+--------+--------+--------+--------
TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740
TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070
TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839
TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830
TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758
TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317
TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SPEED LAYER
• Spark Streaming (codes written in Scala)
• Task 1: Incremental Algorithm to supplement batch layer in tab 3
• Task 2: Rolling Count for dash board Operation for tab 1
Batch Operation
Batch Operation
Speed Speed Speed
data over time
SpeedSpeed
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SERVING LAYER
• De-normalized tables in Cassandra
• TwitterTime Series
• partitioned by ticker symbol
• clustering order by (year, month, day, hour, minute)
• TopTrending Stocks
• partitioned by (year, month, day, hour)
• clustering order by number of mentions
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SPEEDVIEW
• CassandraTTL support can be used for rolling count operation for dashboard
application
• Not available in Cassandra-Spark connector
• Add timestamp and ranking to each ticker generation in each 5 second window
• Partitioned by ranking, clustering order by timestamp
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SHAFI BASHAR
• PhD, ECE, UC Davis
• Present - Intel Corporation
• Worked on 4G LTE,WiFi standardization
• Interest -Algorithm, Machine Learning
• Activities - backpacking, skiing, running,
photography GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE
1Gbps
300Mbps
168Mbps
128Mbps
14Mbps
384Kbps40Kbps
4G LTE-Advanced
4G3G2.5G

More Related Content

Similar to #Cashtag

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_share
Thomas Teske
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First Time
Dean Richards
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
LoQutus
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
Mauro Pagano
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e Splunk
Splunk
 
20150423 m3
20150423 m320150423 m3
20150423 m3
Kazuaki Matsuo
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Databricks
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
DataStax Academy
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
Splunk
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
Manimuthu Ayyannan
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
Bjoern Rost
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -
Codemotion
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
Guy Harrison
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
Anyscale
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Sean Scott
 

Similar to #Cashtag (20)

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_share
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First Time
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e Splunk
 
20150423 m3
20150423 m320150423 m3
20150423 m3
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
 

Recently uploaded

Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
mahaffeycheryld
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Massimo Talia
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
Seetal Daas
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
Kamal Acharya
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
Digital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes completeDigital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes complete
shubhamsaraswat8740
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
Pallavi Sharma
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 

Recently uploaded (20)

Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Assistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdfAssistant Engineer (Chemical) Interview Questions.pdf
Assistant Engineer (Chemical) Interview Questions.pdf
 
Supermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdfSupermarket Management System Project Report.pdf
Supermarket Management System Project Report.pdf
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
Digital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes completeDigital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes complete
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 

#Cashtag

  • 1. #CASHTAG BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS Shafi Bashar
  • 2. MOTIVATION • People have opinions • Different sources, different mediums -Twitter, Reddit, Facebook etc. • Platform for aggregating opinions and analyzing on aTopic • v 1.0: User’s opinion of US stock market
  • 4. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Serving Layer Front End Real-timeView BatchView Data Ingestion
  • 5. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 6. DATA INGESTION • Two sources 1. Twitter Data 2. Stock Data • Twitter Data from streaming API {u'contributors': None, u'coordinates': None, u'created_at': u'Mon Feb 02 07:41:06 +0000 2015', u'entities': {u'hashtags': [], u'symbols': [{u'indices': [0, 3], u'text': u'FB'}], u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026', u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might- be-googles-most-successful-failure-yet.html', u'indices': [67, 89], u'url': u'http://t.co/6iY3WYz82M'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'geo': None, u'id': 562153724219764737, u'id_str': u'562153724219764737', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'}, u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>', u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/ 6iY3WYz82M", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Mon Nov 17 20:15:38 +0000 2014', u'default_profile': True,
  • 7. DATA INGESTION • Stock Data from www.netfonds.no • Incremental CSV file for each individual stocks • Preprocessing to add ticker and time stamp • Multi topic, multi consumer Kafka 20150126T153000 113.67 100 Auto trade 20150126T153000 113.65 161 Auto trade 20150126T153000 113.68 270 Auto trade 20150126T153000 113.67 100 Auto trade 20150126T153001 113.66 100 Auto trade 20150126T153001 113.65 100 Auto trade 20150126T153001 113.67 100 Auto trade 1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,, 1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,, 1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
  • 8. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 9. BATCH LAYER • Spark batch job (written in Scala) • Twitter • Number of mentions and sentiment of the mentions / time granularity • Top trending stocks ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 11. BATCH LAYER • Stocks • high, low, open, close, volume • Azkaban controls the flow and scheduling • Batch layer uses Re-computation Algorithm ticker | year | month | day | hour | minute | close | high | low | open | volume --------+------+-------+-----+------+--------+--------+--------+--------+--------+-------- TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740 TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070 TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839 TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830 TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758 TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317 TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
  • 12. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 13. SPEED LAYER • Spark Streaming (codes written in Scala) • Task 1: Incremental Algorithm to supplement batch layer in tab 3 • Task 2: Rolling Count for dash board Operation for tab 1 Batch Operation Batch Operation Speed Speed Speed data over time SpeedSpeed id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 14. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 15. SERVING LAYER • De-normalized tables in Cassandra • TwitterTime Series • partitioned by ticker symbol • clustering order by (year, month, day, hour, minute) • TopTrending Stocks • partitioned by (year, month, day, hour) • clustering order by number of mentions ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 16. SPEEDVIEW • CassandraTTL support can be used for rolling count operation for dashboard application • Not available in Cassandra-Spark connector • Add timestamp and ranking to each ticker generation in each 5 second window • Partitioned by ranking, clustering order by timestamp id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 17. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 18. SHAFI BASHAR • PhD, ECE, UC Davis • Present - Intel Corporation • Worked on 4G LTE,WiFi standardization • Interest -Algorithm, Machine Learning • Activities - backpacking, skiing, running, photography GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE 1Gbps 300Mbps 168Mbps 128Mbps 14Mbps 384Kbps40Kbps 4G LTE-Advanced 4G3G2.5G