Injustice - Developers Among Us (SciFiDevCon 2024)
Python vs R for Data Analytics Final
1. PYTHON VS R
BY: KENNAN DUFFY, DARIA GBOR, CHRIS LUKENS,
JOHN SAVIELLO, & JAMES SCHEUREN
http://project.mis.temple.edu/pythonvsranalytics/final-deliverables/
2. AGENDA
1. Our Process
2. Use Cases
3. Sentiment Analysis - Python
4. Sentiment Analysis - R
5. Scorecard
6. Recommendation
7. Q & A
2
3. OUR PROCESS
3
RESEARCH
Conduct web research on the use
case and the language
PYTHON
Complete the use case in Python
ANALYZE
Review and analyze the results of
Python & R as a team
R CODE
Complete the use case in R
DEFINE
Define the business purpose of the
use case and completion plan
SCORE
Fill out the scorecard based on previously
defined scoring criteria
4. SCORECARD
4
Criteria Weight (%)
Package Requirement 10%
Lines of Code 5%
Simplicity 10%
Popularity 5%
Development Sources 10%
Data Visualization 15%
Functionality 45%
Total 100%
5. USE CASE #1 - PREDICTIVE ANALYTICS
What
➔ NFL franchise wants to ensure that the player they are selecting
from the draft will be a high performer
How
➔ Linear Regression using the NFL combine dataset from 1985-2015
6. USE CASE #2 - TEXT MINING
What
➔ Justin Trudeau’s campaign team wants to stay updated on
what the public opinion is on him
How
➔ Sentiment analysis using Twitter feed as our dataset
7. USE CASE #3 - IMAGE ANALYTICS
What
➔ England wants to keep track of what is going on in the
busy streets for security purposes
How
➔ Object detection using a picture of a busy street in England
8. SENTIMENT ANALYSIS - PYTHON
8
csv
Allows us to write output to
csv file for analysis
Tweepy
Python library that allows
access twitter API and use
different functions
TextBlob
Natural language processor
to get subjectivity and
polarity of tweets
01
03 02
18. GRADING CRITERIA
1. Package Requirement:
0 packages = 10 points
1 package = 9 points
2 packages = 8 points
3 packages = 7 points
4 packages = 6 points
5 packages = 5 points
6 packages = 4 points
7 packages = 3 points
8 packages = 2 points
9 packages = 1 point
10 packages = 0 points
3. Simplicity:
Quick, really simple to write, really simple to read = 10
Took a while to complete, but pretty simple, easy to understand = 7
Took so long to complete, not very simple, hard to understand = 4
Hard to write, almost impossible, not able to read = 1
4. Popularity:
Very Popular among the industry = 10
A lot of people use this language = 7
Some people use this language = 4
No one uses it = 1
5. Development Sources:
A lot of help in the online community = 10
Some resources available, decently helpful sources = 7
Not many resources available = 4
No help available online = 1
18
6. Data Visualization
Easy to manipulate, cleanliness, visually appealing = 10
Harder to manipulate, messy, not exciting = 7
Harder to manipulate, difficult to read = 4
Unable to manipulate, unreadable = 1
7. Functionality
Accurate data, does everything it needs to do = 10
Mostly accurate data, does most of what it needs to do = 7
Inaccurate data, barely does what it needs to do = 4
Is not able to complete the task = 1
2. Lines of Code:
0-10 lines = 10 points
11-20 = 9 points
21-30 = 8 points
31-40 = 7 points
41-50 = 6 points
51-60 = 5 points
61-70 = 4 points
71-80 = 3 points
81-90 = 2 points
91-100 = 1 point
101 + = 0 points
Lines of code - we set up standard criteria for this measurement so if it was between 1-10 lines it got a 10, 11-20 lines it got a 9, and so forth
Development sources - how strong is the online support community, how many helpful sources are out there for us to help us complete the use case and problem solve if issues arise
Functionality - is it able to do what we want it to & how well is it able to accompish that
TEXTBLOB struggled to identify positive/neutral tweetsExplain how we got accuracy - retrieved 100 tweets and compared them (as a team) to the package results and see if we agreed with the outcome
Neutral = 10/53
Negative = 11/21
Positive = 7/26
Syuzhet- Used for sentiment analysis - what is reading the tweets
T M - Works with Snowball C and TwitteR to mine text
TwitteR - Interacts with Twitter API to get tweet for analysis
Snowball C - Makes words more concise so that they are easier for other packages to read
Explain how we got accuracy - retrieved 100 tweets and compared them (as a team) to the package results and see if we agreed with the outcome
50% overall. 77% negative (30/39). 27% positive (9/33). 39% neutral (11/28). Accuracy
MENTION: Functionality
USE FOR LESSONS LEARNED
Found out that it is more accurate with negative tweets
Not perfect, picks out certain words to decide whether it is positive or negative. Sarcasm is difficult.
Shouldn’t trust positivity tweet analyses
Language is built for predictive analytics, ready to run predictive analytics where as python needs to be molded into running the linear regression
The packages we ran for R were much more accurate than the Python packages for running sentiment analysis
More functionality available when running image analytics than Python and very simple to change, it was a matter of changing only 2 lines of code to switch between face detection, landmark detection, logo detection, object detection