Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Guess the Country - Playing with Twitter Streaming API


Published on

Using the Twitter statuses sample API to build a name<->country database

  • Be the first to comment

Guess the Country - Playing with Twitter Streaming API

  1. 1. Guess the Country Playing with Twitter Streaming API Chris Birchall #m3dev Tech Talk 2014/7/11
  2. 2. It started with an idle tweet...
  3. 3. Let’s use Twitter for something (slightly) useful! The plan: ● Collect geo-tagged tweets from Twitter Streaming API ● Use them to build a name⇔country DB ● Build a simple search UI as a proof of concept ● (crowbar Spark in there somewhere because it’s cool)
  4. 4. Implementation Twitter Streaming API EC2 Twitter4j .log Fluentd S3 EC2 Spark Postgres (RDS) Heroku Rails
  5. 5. Collecting tweets ● Ran the collector for 13 days ● Collected 285,340 geo-tagged tweets ● 205,798 distinct users ● Only collected names and countries, threw everything else away ● Used Spark to filter out duplicate users Processing
  6. 6. Stats Top 10 countries by user count Distinct countries = 204 Distinct first names = 40,689 Distinct last names = 81,674 country | percentage -----------------------------+------------ United States | 39.4 United Kingdom | 10.1 Indonesia | 8.9 Brasil | 8.1 Türkiye | 3.9 España | 2.4 México | 2.2 Republic of the Philippines | 2.0 Canada | 1.8 Malaysia | 1.8 first_name ------------ chris alex david michael sarah second_name ------------- smith jones garcia williams johnson Most popular first names Most popular surnames
  7. 7. Results It works surprisingly well! (well, it worked for my name, anyway) Note for the pedantic: Since the original data is geo-tagged tweets, strictly speaking we only know where a user is, not where they come from.
  8. 8. Try for yourself Demo Code