SlideShare a Scribd company logo
#CASHTAG
BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS
Shafi Bashar
MOTIVATION
• People have opinions
• Different sources, different mediums -Twitter, Reddit, Facebook etc.
• Platform for aggregating opinions and analyzing on aTopic
• v 1.0: User’s opinion of US stock market
DEMO
•Webpage
http://www.hashtagcashtag.com
•Video
http://youtu.be/7oMrJ7n1Hr4
• Alternate Link
http://54.67.108.50
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer Serving Layer Front End
Real-timeView
BatchView
Data Ingestion
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
DATA INGESTION
• Two sources
1. Twitter Data
2. Stock Data
• Twitter Data from streaming API
{u'contributors': None,
u'coordinates': None,
u'created_at': u'Mon Feb 02 07:41:06 +0000 2015',
u'entities': {u'hashtags': [],
u'symbols': [{u'indices': [0, 3], u'text': u'FB'}],
u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026',
u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might-
be-googles-most-successful-failure-yet.html',
u'indices': [67, 89],
u'url': u'http://t.co/6iY3WYz82M'}],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'geo': None,
u'id': 562153724219764737,
u'id_str': u'562153724219764737',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
u'place': None,
u'possibly_sensitive': False,
u'retweet_count': 0,
u'retweeted': False,
u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>',
u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/
6iY3WYz82M",
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Mon Nov 17 20:15:38 +0000 2014',
u'default_profile': True,
DATA INGESTION
• Stock Data from www.netfonds.no
• Incremental CSV file for each individual stocks
• Preprocessing to add ticker and time stamp
• Multi topic, multi consumer Kafka
20150126T153000 113.67 100 Auto trade
20150126T153000 113.65 161 Auto trade
20150126T153000 113.68 270 Auto trade
20150126T153000 113.67 100 Auto trade
20150126T153001 113.66 100 Auto trade
20150126T153001 113.65 100 Auto trade
20150126T153001 113.67 100 Auto trade
1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,,
1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,,
1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
BATCH LAYER
• Spark batch job (written in Scala)
• Twitter
• Number of mentions and sentiment of the mentions / time
granularity
• Top trending stocks
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SENTIMENT ANALYSIS
"downgraded",
"bears",
"bear",
"bearish",
"volatile",
"short",
"sell",
"selling",
"forget",
"down",
"resistance",
"sold",
…
"upgrade",
"upgraded",
"long",
"buy",
"buying",
"growth",
"good",
"gained",
"well",
"great",
"nice",
"top",
…
Positive
Negative
BATCH LAYER
• Stocks
• high, low, open, close, volume
• Azkaban controls the flow and scheduling
• Batch layer uses Re-computation Algorithm
ticker | year | month | day | hour | minute | close | high | low | open | volume
--------+------+-------+-----+------+--------+--------+--------+--------+--------+--------
TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740
TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070
TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839
TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830
TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758
TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317
TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SPEED LAYER
• Spark Streaming (codes written in Scala)
• Task 1: Incremental Algorithm to supplement batch layer in tab 3
• Task 2: Rolling Count for dash board Operation for tab 1
Batch Operation
Batch Operation
Speed Speed Speed
data over time
SpeedSpeed
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SERVING LAYER
• De-normalized tables in Cassandra
• TwitterTime Series
• partitioned by ticker symbol
• clustering order by (year, month, day, hour, minute)
• TopTrending Stocks
• partitioned by (year, month, day, hour)
• clustering order by number of mentions
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SPEEDVIEW
• CassandraTTL support can be used for rolling count operation for dashboard
application
• Not available in Cassandra-Spark connector
• Add timestamp and ranking to each ticker generation in each 5 second window
• Partitioned by ranking, clustering order by timestamp
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SHAFI BASHAR
• PhD, ECE, UC Davis
• Present - Intel Corporation
• Worked on 4G LTE,WiFi standardization
• Interest -Algorithm, Machine Learning
• Activities - backpacking, skiing, running,
photography
GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE
1Gbps
300Mbps
168Mbps
128Mbps14Mbps384Kbps40Kbps
4G LTE-Advanced
4G3G2.5G

More Related Content

Similar to Hashtag cashtagfinal_1

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_share
Thomas Teske
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First Time
Dean Richards
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
LoQutus
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
Mauro Pagano
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e Splunk
Splunk
 
20150423 m3
20150423 m320150423 m3
20150423 m3
Kazuaki Matsuo
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Databricks
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
DataStax Academy
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunk
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
Manimuthu Ayyannan
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
Bjoern Rost
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -
Codemotion
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
Guy Harrison
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
Anyscale
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Sean Scott
 

Similar to Hashtag cashtagfinal_1 (20)

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_share
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First Time
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e Splunk
 
20150423 m3
20150423 m320150423 m3
20150423 m3
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
 

Recently uploaded

Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
AkolbilaEmmanuel1
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 

Recently uploaded (20)

Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 

Hashtag cashtagfinal_1

  • 1. #CASHTAG BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS Shafi Bashar
  • 2. MOTIVATION • People have opinions • Different sources, different mediums -Twitter, Reddit, Facebook etc. • Platform for aggregating opinions and analyzing on aTopic • v 1.0: User’s opinion of US stock market
  • 4. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Serving Layer Front End Real-timeView BatchView Data Ingestion
  • 5. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 6. DATA INGESTION • Two sources 1. Twitter Data 2. Stock Data • Twitter Data from streaming API {u'contributors': None, u'coordinates': None, u'created_at': u'Mon Feb 02 07:41:06 +0000 2015', u'entities': {u'hashtags': [], u'symbols': [{u'indices': [0, 3], u'text': u'FB'}], u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026', u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might- be-googles-most-successful-failure-yet.html', u'indices': [67, 89], u'url': u'http://t.co/6iY3WYz82M'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'geo': None, u'id': 562153724219764737, u'id_str': u'562153724219764737', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'}, u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>', u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/ 6iY3WYz82M", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Mon Nov 17 20:15:38 +0000 2014', u'default_profile': True,
  • 7. DATA INGESTION • Stock Data from www.netfonds.no • Incremental CSV file for each individual stocks • Preprocessing to add ticker and time stamp • Multi topic, multi consumer Kafka 20150126T153000 113.67 100 Auto trade 20150126T153000 113.65 161 Auto trade 20150126T153000 113.68 270 Auto trade 20150126T153000 113.67 100 Auto trade 20150126T153001 113.66 100 Auto trade 20150126T153001 113.65 100 Auto trade 20150126T153001 113.67 100 Auto trade 1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,, 1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,, 1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
  • 8. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 9. BATCH LAYER • Spark batch job (written in Scala) • Twitter • Number of mentions and sentiment of the mentions / time granularity • Top trending stocks ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 11. BATCH LAYER • Stocks • high, low, open, close, volume • Azkaban controls the flow and scheduling • Batch layer uses Re-computation Algorithm ticker | year | month | day | hour | minute | close | high | low | open | volume --------+------+-------+-----+------+--------+--------+--------+--------+--------+-------- TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740 TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070 TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839 TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830 TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758 TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317 TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
  • 12. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 13. SPEED LAYER • Spark Streaming (codes written in Scala) • Task 1: Incremental Algorithm to supplement batch layer in tab 3 • Task 2: Rolling Count for dash board Operation for tab 1 Batch Operation Batch Operation Speed Speed Speed data over time SpeedSpeed id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 14. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 15. SERVING LAYER • De-normalized tables in Cassandra • TwitterTime Series • partitioned by ticker symbol • clustering order by (year, month, day, hour, minute) • TopTrending Stocks • partitioned by (year, month, day, hour) • clustering order by number of mentions ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 16. SPEEDVIEW • CassandraTTL support can be used for rolling count operation for dashboard application • Not available in Cassandra-Spark connector • Add timestamp and ranking to each ticker generation in each 5 second window • Partitioned by ranking, clustering order by timestamp id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 17. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 18. SHAFI BASHAR • PhD, ECE, UC Davis • Present - Intel Corporation • Worked on 4G LTE,WiFi standardization • Interest -Algorithm, Machine Learning • Activities - backpacking, skiing, running, photography GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE 1Gbps 300Mbps 168Mbps 128Mbps14Mbps384Kbps40Kbps 4G LTE-Advanced 4G3G2.5G