More Than Websites: PHP And The Firehose @DataSift (2013)

16,422 views
16,622 views

Published on

PHP is the world's #1 programming language for creating websites. But it's capable of so much more. How about real-time processing the social firehose? :)

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
16,422
On SlideShare
0
From Embeds
0
Number of Embeds
12,716
Actions
Shares
0
Downloads
52
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

More Than Websites: PHP And The Firehose @DataSift (2013)

  1. 1. More Than Websites And The Firehose @Saturday, 23 March 13
  2. 2. Introduce Yourselves @Saturday, 23 March 13
  3. 3. @stuherbert @Saturday, 23 March 13
  4. 4. What is @Saturday, 23 March 13
  5. 5. Sift through social data Twitter firehose, Facebook, bitly clicks, news, videos, comments and more @Saturday, 23 March 13
  6. 6. Gain insights using augmentations Language, gender, trends, links, sentiment, salience & entity analysis and more @Saturday, 23 March 13
  7. 7. Realtime Get matching data within seconds of it being posted @Saturday, 23 March 13
  8. 8. Historics Search our social data archive going back to January 2010 @Saturday, 23 March 13
  9. 9. Pull the data from our servers via HTTP/1.1 streaming or websockets @Saturday, 23 March 13
  10. 10. Let us push data to you Have the data delivered directly to your servers or into your databases @Saturday, 23 March 13
  11. 11. in numbers @Saturday, 23 March 13
  12. 12. 30 Sources of social data and data augmentations @Saturday, 23 March 13
  13. 13. Up to 20,000 Number of new pieces of data ingested into DataSift every second @Saturday, 23 March 13
  14. 14. 3 Terabytes Amount of new data added to the Historics archive every week @Saturday, 23 March 13
  15. 15. 12 Different ways we can deliver data to you @Saturday, 23 March 13
  16. 16. 1 Average number of seconds to pass the data through DataSift @Saturday, 23 March 13
  17. 17. 12 Number of services data passes through inside DataSift @Saturday, 23 March 13
  18. 18. 25 Number of engineers who write code for the DataSift platform @Saturday, 23 March 13
  19. 19. 5 Primary programming languages: C++, Node, PHP, Python, Scala @Saturday, 23 March 13
  20. 20. 154 Private GitHub repos @Saturday, 23 March 13
  21. 21. PHP Java & Scala C & C++ JS & Node Unclassified Python Shell Script Ruby C# VimL 0 15 30 45 60 Our GitHub Repositories @Saturday, 23 March 13
  22. 22. Architecture @Saturday, 23 March 13
  23. 23. Three major data pipelines + supporting services @Saturday, 23 March 13
  24. 24. Data Archiving Adds new data to the Historics Archive @Saturday, 23 March 13
  25. 25. Filtering Pipeline Filtering and delivery of data in realtime @Saturday, 23 March 13
  26. 26. Playback Pipeline Filtering and delivery of data from the Historics Archive @Saturday, 23 March 13
  27. 27. DataSift Architecture 2.2 HBase Cluster @lorenzoalberton Data ingestion + Augmentation Ultrahose HDFS Kafka Input Streams Goblin Head Goblin Tail Ultrahose Region 1 Region 2 ... Region N Msg splitter Archiver Archiver Archiver Stream Twitter Goblin Head Goblin Tail Splitter/Joiner Deduper Goblin Head Goblin Tail Augmentation Pipeline Bit.ly HttpStreaming, PuSH, Search Redis Deletes Ogre Ogre Ogre Processor Language Hadoop Facebook Detection 100% Data Node Data Node Data Node Data Node Data Node Wikipedia Ogre Interaction ... Interaction Ogre Ogre Ogre Ogre Trends Sentiment Targets Targets Reddit Ogre Demographics Analysis Analysis Mapping Mapping LexisNexis Interaction Interaction Ogre Ogre Ogre Ogre Ogre Ogre ... Generation Generation Meltwater Ogre Ogre Ogre Ogre Ogre Topics Klout Named Estimize Ogre Analysis Score + Profile Entities Filtering Filtering Digg Ogre Tardis ... Tardis Ogre Ogre Ogre Ogre Links Resolution Ogre 3rd party APIs Pickle Pickle + OpenGraph Stream Kafka Ogre Ogre Ogre Ogre Ogre Ogre + Twitter Cards NewsCred Ogre Recorder Map/Reduce Historical Queries Ogre Ogre Ogre Ogre + Metadata BoardReader Ogre 100% MySpace Titan Historics Stream results SuperFeeder Prism Control jobs chunks chunk job 100% 100% Channels DB DB selector tracker Historics s ult Time Machine + Insights Scheduler PickleDB . res Post-Processing, Stream Analytics DB Node Shard Node Shard Node Shard Node Shard am re St Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Recording Node Node Node Node Node Node Node Node CSDL Compiler, Filtering Scheduler push push Validator, Pickle push Pickle Pickle Engine Pickle push Pickle Pickle Pickle Pickle Normaliser Node Node Node Node Node Node Node Node Push Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Scheduler Node Node Node Node Exports and Node Node Node Node Analytics (D5) Hardware Definition . Load Balancer DB Meteor Node Manager . WebSockets Node @datasift Real-time Streams HTTPStreaming Node Stream . ACL ACL DB EDRs (with interaction API Manager . (with interaction (licensed counter) content counter) metrics) Mask Snapshotter Worker DB WEB Manager HTTP Request Worker Buffered Redis Streams GET batch Worker Delivery Subscriptions Monitoring Kafka Connection Queue Manager Authentication PUSH PUSH job queue DB Manager Producer Scheduler tracker Billing Pipeline DB Connections Storage Subscriptions HTTP(S) POST Events DB (S)FTP Notification License Storage Service Manager DB Amazon S3 Cloud Storage DynamoDB kafka-consumer subscription X Microsoft Azure DBs Limit Monitoring Audit DB MongoDB Manager Aggregator BI tools subscription Y Oracle PUSH CouchDB Stop Delivery IBM Cognos PUB Google BigQuery DataSift Technical Architecture @Saturday, 23 March 13
  28. 28. DataSift Architecture 2.2 HBase Cluster @lorenzoalberton Data ingestion + Augmentation Ultrahose HDFS Kafka Input Streams Goblin Head Goblin Tail Ultrahose Region 1 Region 2 ... Region N Msg splitter Archiver Archiver Archiver Stream Twitter Goblin Head Goblin Tail Splitter/Joiner Deduper Goblin Head Goblin Tail Augmentation Pipeline Bit.ly HttpStreaming, PuSH, Search Redis Deletes Ogre Ogre Ogre Processor Language Hadoop Facebook Detection 100% Data Node Data Node Data Node Data Node Data Node Wikipedia Ogre Interaction ... Interaction Ogre Ogre Ogre Ogre Trends Sentiment Targets Targets Reddit Ogre Demographics Analysis Analysis Mapping Mapping LexisNexis Interaction Interaction Ogre Ogre Ogre Ogre Ogre Ogre ... Generation Generation Meltwater Ogre Ogre Ogre Ogre Ogre Topics Klout Named Estimize Ogre Analysis Score + Profile Entities Filtering Filtering Digg Ogre Tardis ... Tardis Ogre Ogre Ogre Ogre Links Resolution Ogre 3rd party APIs Pickle Pickle + OpenGraph Stream Kafka Ogre Ogre Ogre Ogre Ogre Ogre + Twitter Cards NewsCred Ogre Recorder Map/Reduce Historical Queries Ogre Ogre Ogre Ogre + Metadata BoardReader Ogre 100% MySpace Titan Historics Stream results SuperFeeder Prism Control jobs chunks chunk job 100% 100% Channels DB DB selector tracker Historics s ult Time Machine + Insights Scheduler PickleDB . res Post-Processing, Stream Analytics DB Node Shard Node Shard Node Shard Node Shard am re St Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Recording Node Node Node Node Node Node Node Node CSDL Compiler, Filtering Scheduler push push Validator, Pickle push Pickle Pickle Engine Pickle push Pickle Pickle Pickle Pickle Normaliser Node Node Node Node Node Node Node Node Push Pickle Pickle Pickle Pickle Pickle Pickle Pickle Pickle Scheduler Node Node Node Node Exports and Node Node Node Node Analytics (D5) Hardware Definition . Load Balancer DB Meteor Node Manager . WebSockets Node @datasift Real-time Streams HTTPStreaming Node Stream . ACL ACL DB EDRs (with interaction API Manager . (with interaction (licensed counter) content counter) metrics) Mask Snapshotter Worker DB WEB Manager HTTP Request Worker Buffered Redis Streams GET batch Worker Delivery Subscriptions Monitoring Kafka Connection Queue Manager Authentication PUSH PUSH job queue DB Manager Producer Scheduler tracker Billing Pipeline DB Connections Storage Subscriptions HTTP(S) POST Events DB (S)FTP Notification License Storage Service Manager DB Amazon S3 Cloud Storage DynamoDB kafka-consumer kafka-consumer subscription X Microsoft Azure DBs Limit Monitoring Audit DB MongoDB Manager Aggregator BI tools subscription Y Oracle PUSH CouchDB Stop Delivery IBM Cognos PUB Google BigQuery Filtering Pipeline @Saturday, 23 March 13

×