2. About us
● You might have heard of uberVU
● Acquired by Hootsuite to develop their new analytics
● New scalability challenges => millions of customers
● Everything is in the cloud => Amazon
● 10 M social media posts per day
● 100+ Amazon EC2 instances of various sizes
● 650 GB media posts/month
● 400 GB worth of analytics data
● 10+ MongoDB clusters
3. What this presentation will be about
● What is the big data we’re working with
● Infrastructure
● Technologies
● What we do on top of the data
● How we currently display the data
● What’s in store for the future
4. Our data
● Our data is made up of media posts
● Revolves around search queries
● You can also connect accounts and pages
● Sources: Twitter, Facebook, Google+, Wordpress, Flickr,
Picasa and others
● To acquire data
○ a lot of REST
○ some streaming
○ very little scraping
5. Mentions
● Every piece of data that we process
○ gets normalized
○ annotated
○ convert everything to a standard JSON format
● We call the end result a mention
7. Annotations
● Language detection
○ written in C++, wrapped in Python
○ can process ~300 mentions/second
● Sentiment detection
○ external provider
○ piggybacking for efficiency
● Location detection
○ in-house algorithm
○ text tokenization and matching against a locations database
8. The Pipeline
● Our processing infrastructure is a pipeline
● Producer-Consumer pattern
● Enables us to scale parts of the infrastructure separately
10. The pipeline (3)
● 100+ consumer types
● 450 consumer instances
● Automatic scaling algorithm developed by us
○ whenever a consumer falls behind, the system deploys new
consumer instances
○ automatically adjusts cluster size
11. MongoDB
● 10+ clusters
● Our biggest cluster
○ 1500 operations/second
○ m2.xlarge instances (17GB RAM, 6.5 ECU)
○ 8x80 GB RAID10
● Hard to manage databases in multiple clusters
○ we wrote mongo-pool https://github.com/uberVU/mongo-pool
● Cluster pyramid structure for cost efficiency
● Communication between clusters through our own mongo-oplogreplay
https://github.com/uberVU/mongo-oplogreplay
13. Kestrel
● Distributed message queue developed by Twitter
● Uses Memcache protocol
● Disk persistence
● 400 consumer operations/second
● Part of our pipeline core
● Extremely reliable
● Sending gzipped content to save I/O costs
14. Redis, Memcache
● Gradually replacing Memcache with Redis
● Used for high-access temporary information
● 60 GB worth of data
15. Other technologies
● RabbitMQ asynchronous tasks
● DynamoDB
○ analytics
○ auxiliary permanent storage use cases that don’t take a lot of
space
● S3 for data with low access rates
● Glacier for archived data
16. System metrics & monitoring
● Graphite
● 150K system metrics
● Alerts are being generated based on Graphite
● In-house alert-detection
● Nagios/Nagstamon
● PagerDuty for on-call
17. Analytics overview
● Currently in DynamoDB
● MongoDB still runs in parallel, considering a move back to
MongoDB
● Breakdowns on language, location, sentiment, gender
● Support for several resolutions (day, hour, 15min)
● Optimized for language and location filtering (95% of our
queries)
18. Aggregation pipeline
● We aggregate analytics to reduce writes
● Efficient but simple concurrency through Redis primitives
● Got a 5x improvement !
20. Tagcloud
● Tagcloud algorithm that can detect n-grams of all lengths
● Some of the data we analyze is blog content, can be very big
● Needed something fast
● In-house algorithm
○ linear complexity (doesn’t go up with max n-gram size)
○ based on statistical correlations
21. Signals
● Need to synthesize all the data we’re collecting
● Top Stories
○ O(1) algorithm
● Influencers
○ dependency graph
○ edges are interactions between users
● Spikes & Bursts
○ code written in C++ to reduce time
○ statistical algorithms on top of our analytics timeseries
○ adapt reads from analytics based on data size
22. Boards
● Needed a good way to display all this data
● Designed Boards
○ released this year
○ allows you to create a dashboard with metric visualizations
○ drag-drop widgets to arrange them your way
24. Future plans
● Hootsuite has millions of users
● Our analytics infrastructure will have to scale
○ Transitioning to streaming services
○ Larger MongoDB clusters
■ more shards for write throughput
■ more secondaries for reads
● Add more metrics