Your SlideShare is downloading. ×
0
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

2,312

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,312
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. JoeOlson DataArchitect SmartChicagoCollaborative 27Mar2014 joe.olson@cct.org (All the cool buzzwords in one place!) Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics
  • 2. Social Media - Twitter • What can we learn from Twitter? • 400 million tweets per day source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter • 218 million users source: http://techcrunch.com/2013/10/03/bweeting/ • Excellent source of sentiment • Excellent source of big data • Prototyping • Modeling natural language • Resume padding
  • 3. Social Media - Twitter • How do we get at the data? • Twitter provided APIs: • https://dev.twitter.com/docs • Streaming • Set up a real time data stream (json) based on keywords • REST (v1.1) • Make REST requests, and get results • Possible parameters: • Geospatial bounding box • By time • By user, hashtag, retweets etc • Fire hose • Big $$$. Big data
  • 4. Social Media - Twitter • Information & Obstacles • Who • What • At best: Plain English (!) • Worse: (Spanish or Arabic or Portuguese...) • Worst: “Textspeak” symbols :-0, UTF8 chars, etc. • Absolute Worst: combination of all of them • Where • 1-2% with latitude / longitude • Geocode • When
  • 5. Social Media - Twitter JSON Tweet example: • "created_at":"Sun Oct 27 13:57:40 +0000 2013", • "id":394462908261740540, • "text":"Flu :(", • "source":"<a href="http://twitmania.com" rel="nofollow">TwitMania™</a>", • "user":{ • "id":594141140, • "name":"Yultiana Farida N", • "screen_name":"yultiana", • "followers_count":231, • "friends_count":252, • "created_at":"Tue May 29 23:58:25 +0000 2012", • "statuses_count":2397, • }, • "geo":null,
  • 6. Cloud Computing • What does cloud computing bring to the table? • Amazon’s EC2: • Commoditized hardware • Low cost • Only charged for resources you use • No long term commitments • Scalable • "Throwaway" mentality **IF** you play by their rules!
  • 7. Cloud Computing – AWS • Tools • Virtual Machines • # of Processors, RAM, OS, disk capacity and I/O – all configurable • Price range: $.02/hr - $4.60/hr • Licensed OSes cost 50% more than Linux OSes • Archive Storage • S3 / Glacier • Work Queues • SQS • Data Stores • Dynamo (key value store), Red Shift (analysis store) • Virtual Networking • Routers, VPN gateways, access control lists, etc • APIs • Command line • HTTPS REST • Native programming languages (Python, bash, PHP, Java etc.) Ideal for rapid prototyping / proof of concepts
  • 8. Cloud Computing – AWS • APIs • Basic • Start an instance (and start billing) • Stop an instance (stop billing) • Insert item into queue • Remove item from queue • Write to backup store • Ultra advanced • Reserved vs. on demand vs. spot instances • Price can drop as much as 80% due to market demand • Instance can disappear at any time
  • 9. Big Data Analytics • Can we skirt the “big data” problem by distilling the tweets down from millions and millions “noise” tweets into a more desirable data set? • Enrich in real time, rather than on archived data, and avoid the overhead of map/reduce? • Possible Enrichment of raw data: • Classification – separate tweets into “relevant” and “irrelevant” • Geocoding – improve on the 1-2% ? • Aggregation –> map reduce • Mapping -> Reduce Function -> Output • AWS – Elastic Map Reduce • Clustering
  • 10. Machine Learning • Classification: relevant, or irrelevant? • Human trained model • Once model is established, bounce new data off it for classification • Validation of model • Accuracy = (Total # of classifications – Mismatches between machine / human) Total # of classifications • Crowdsourcing – AWS Mechanical Turk • Improve model by feeding disagreements back into the model • Our best text classification model to date: low 90%
  • 11. Open Source • Friendly to the commoditized computing paradigm • Don’t have to worry about licensing issues • Contributes to the “throwaway” discipline • Don’t have to re-invent the wheel (collaboration) • Solutions applicable to all parts of the architecture • Acquire data: Node.js – non blocking • Analyze data: R – statistical engine • Store and query data: MongoDB (document store) or Riak (key- value database)
  • 12. Architecture • We know Twitter is providing a mountain of data from all parts of the world • We know Amazon is providing a framework of low cost, on- demand, no commitment computing • Open source is providing a rich tool set • Goals: • Architect with cost in mind! • Enrichment - Real time and after-the-fact enrichment (open data) • Scalable • Decoupled • Service based • Rapid development • Prove the concepts
  • 13. Architecture - Acquire • Acquire the data from Twitter • If classifying in real time: • Store then classify? • Classify then store? • Tools • Twitter streaming API • Keywords • Node.js • Several different packages to interface with Twitter APIs • Amazon • EC2 • SQS (?) Extremely useful, but drives the cost up
  • 14. Architecture - Analyze • Classification interface • Service based – HTTP REST • Push or pull? • Push – classifiers listen on port 80 • Pull – classifier starts pulling from an established work queue • Both highly scalable and flexible with respect to cost. • Stateless • R • Human trained machine learning packages available • Cloud friendly – no licenses • Automatable – from install, configuration, execution
  • 15. Architecture - Store • Store JSON as an object (document store) or normalize (relational database)? • Relational databases • disk I/O intensive – not cloud friendly • allow complex indexing • Easy to get a business intelligence front end on them • Requires a schema / ETL • Key-value document stores • Designed to be scalable – doesn’t need fast disks • Indexing is not nearly as flexible as RDBMS • More difficult to front a UI – no “drag and drop” tools • No schema / ETL needed. • Not as mature • MongoDB / Riak
  • 16. Architecture – Presentation • Least need for cloud friendly scalability here? • Options • Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho • Open source BI software – SpagoBI • Roll your own - PHP, Ruby, Visual Basic, Javascript, etc • Connect to an existing system instead?
  • 17. Costs – Real Time Classification • Number of tweets collected per day: 1,000,000 (comfortable - .25%) • Machine used on EC2 to acquire (node.js): micro • $.02/hr * 24 hrs = .48/day • Machine used on EC2 to classify (R): small (x2) • $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day • Machine used on EC2 to store (MongoDB): large • $.24/hr * 24 hrs = $5.76 /day • Machine used on EC2 for GUI (Apache): small • $.06/hr * 24 = $1.44 • $0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 = .00001056 cents/tweet Can add more zeros if you relax real-time classification (spot instances)
  • 18. Costs - Archive • Size of average tweet: 2.5 KB • Cost to archive: • s3 : .095 GB/month • 0.0000002 per tweet per month • Glacier: .01 GB/month • 0.00000002 per tweet per month • Compression will add even more zeros, but will require more computing power, and mean more latency for post collection data analysis. Can be automated.
  • 19. Use Cases • Foodborne Chicago (http://foodborne.smartchicagoapps.org/) • Public-private partnership with City of Chicago Dept. of Public Health and Smart Chicago Collaborative • Reach out to city residents on Twitter tweeting about food poisoning symptoms, in an attempt to get them to log information in the City’s 311 database (via the Open311 API) • Once in the 311 database, it follows established City workflows, and becomes actionable • Numbers (1 year): • 2,390 tweets classified as related to food poisoning • 282 tweets responded to • 205 reports submitted • 145 inspections • Real time classification examples: • “Ugh! I got food poisoning from the McDonalds’s on Halstead!” http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead • “U of Chicago releases a new paper on the effects of food poisoning” http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning • Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
  • 20. Use Cases • Disease Tracker • Large scale attempt to track disease occurrences in the United States. • Sponsored by the Dept. of HHS • Approximately 1 million tweets a day (cold, flu) classified in real time • EC2 scalable instances • Geolocation • Cost to run for 6 months: $850
  • 21. Future Directions • Turnkey service • Can all this functionality be abstracted down to a pushbutton service? • Open data • Can you advertise the data collected, how you enriched it, and allow others to come along an enrich it as well? • General purpose bridge between Twitter and issue tracking databases • Big industry problem
  • 22. Github Sources • Tweet Collector • https://github.com/smartchicago/TweetCollector • Classifier Code • https://github.com/corynissen/foodborne_classifier

×