1. wrangle_report
December 12, 2020
1 Wrangle Report
1.1 WeRateDogs Twitter Acount Data Wrangling
1.2 Intoduction
1.2.1 The porpose of this project is to wrangle data about twitter acount WeRateDogs
from 3 different sources to create interesting and trustworthy analyses and
visualizations.
2 Project Details
2.0.1 1. Gathering
2.0.2 2. Assess
2.0.3 3. Clean
2.0.4 4. Store
2.1 Gathering
2.1.1 1. Twitter archive file was downloaded manually; it contains basic tweet data
for all 5000+ of their tweets, but not everything.
2.1.2 2. Image Predictions File downloaded programatically every image in the WeR-
ateDogs Twitter archive through a neural network that can classify breeds of
dogs*. The results: a table full of image predictions (the top three only) along-
side each tweet ID, image URL, and the image number that corresponded to
the most confident prediction (numbered 1 to 4 since tweets can have up to
four images).
2.1.3 3. Additional Data via the Twitter API: I successfully created a Twitter De-
veloper acount and collected more data with the tweets Id column from the
Twitter archive file.
A Line Brake
3 Assess
3.0.1 Twitter archive
1. the archive have 2356 rows only 2278 are tweets
1
2. 2. some ratings are too high and the type should be a float
3. the rating numerator has very high and very low values it should be 10 or a multiple of ten
for multiple dog ratings
4. the name of the dog have non-name values
5. the timestamp column is of type object
6. doggo, floofer, pupper and puppo these are values not columns names the should be melt into
one column
3.0.2 Image Predictions
1. not all tweets have a valid pic of dog; the col“p1_dog, p2_dog, p3_dog” are false
2. jpg_url has duplicates
3. tidiness issue that (p1,p2,p3) (p1_conf,p2_conf,p3_conf),and (p1_dog ,p2_dog ,p3_dog )
are in 3 columns instead of one
3.0.3 Twitter API
1. to manny info a bout the tweets was rtrieved from Twitter Api i choose the retweet count
and favorite count
4 Clean
4.0.1 Twitter archive
1. first make a copy of the archive_df
2. remove all retweets and tweets without a photo
3. convert timestamp to datetime format
4. extract ratings from the tweets text and invistigate them
5. clean the name column
6. crate a column named dog_class and append the 4-columns of class in it
4.0.2 Image Predictions
1. remove all tweeets with 3 algorithms failed to predict a dog breed
2. remove all duplicated photos
3. i choose only the first algorithm to continue the analysis
4.0.3 Twitter API: no cleaning needed
5 Store
5.0.1 i merged the tree data frames into one master data frame stored it as ‘twit-
ter_archive_master.csv’ it has tweets with a photoor more only with the
retweet count and favorite count and a most confidence prediction of the dog
breed as a name if excist and dog stage if excist.
2