0
Algorithmic     Trading(In Los Angeles)LA Algorithmic Trading Meetup - Winter 2013
What’s Good?
WelcomeTim Shea  tim@whatsgood.com  @sheanineseven Data Scientist Ad Agency Guy (Razorfish, Universal, TBWAChiatDay) Found...
Elevator PitchDigital Menu Platform for picky eaters on-the-go.Data-centric POVSearch/Sort/Slice/Dice, Answering “What’s G...
“Dimensionality”Hundreds of data points *behind* each menu item.This data is *hidden* by traditional analog menus.Dimensio...
Thomas Guide Thomas Guide :: Google Earth             As  Paper Menu :: What’sGood
Hypothesis
We’re Empiricists
ProblemNoise80/20 - In any scenario where you’re ordering food (ex. at-home, in-restaurant, etc) 80% of menu info is noise...
ResultHuman error.Leads to:Frustration - “Ill just get what I usually get”Alienation - “I’m going out with my meat-eating ...
HypothesisBigData + Machine Learning + The CrowdWill remove these pain points.And create something truly valuable for peop...
What’sGood Algos
ClydeStorm
ComponentsFoodNet + Vegas8 + Rhombus
ClydeStormMenu Ingestion - Every 2 weeks, reconcile 400,000Restaurants and 50MM Menu Items (Add/Edit/Delete)NLP Classifier...
Vegas8Based on a simple human Intuition:“Signal Words” helps us make 1 of 3 determinations:1. Definitely Positive - “Vegan...
The Intuition
FoodNetBased loosely on WordNet - Open Source Princeton projectLexical Knowledge Graph or word relations (vs a list)ex. Ob...
First Attempt
First VersionRead from Menu DB - 50MM Venue, Dish Title & DescriptionRead from Synonym DB - Slam it into a big RegExFor Ea...
Results?Medoicre- Took forever to run- Unexpected results (think: RegEx)- Tons of edge cases
Algorithms and NLP
Stepping BackHow do we find better tools for the job?How do we measure any improvements we make?Is there a more “Algorithm...
Not #NLP*Not* Nuero Linguistic Programming
What is NLP?Natural Language ProcessingAttempt to formalize the ways in which humans understandlanguage, into a computer p...
Widely ApplicableSemantic Analysis - Whats the overall mood here?Text Classification - What is this document I’m reading?K...
What’sGood Use Cases
1. Similarity
Are these the same?  A Frame, 12565 W Washington Blvd, Culver City, CA, 90066  A-Frame, 12565 Washington Blvd, Los Angeles...
This Problem         Creme Brulee              vs         Crème brûlée              vs       Cr�e Bru001lee
This Problem
OrthogonalityRhombus - The What’sGood Decoder RingLibrary that attempts to resolve “Matching Problems”	For Example: Public...
TextGrounderDisambiguate:- Georgia vs GeorgiaContext:- Melrose Heights vs West Hollywood vs Los Angeles
2. Sentiment
“Bag of Words”Type of Naive Bayes Classifer		Tokenize	 Remove Stop Words	 Stemming the remaining words	 Frequency Distribu...
Edge CasesYelp Review - Comme Ca“You’d expect a place with such a diverse selection of french food,wonderfully accomodatin...
Other Great TricksPart of Speech TaggingN-GramsLevenshtein DistanceRevMiner
Humans!!National Weather Service	 Tries to quantify the effect of humans:		 - Precipitation forecasts - 25% lift		 - Tempe...
3. Relevancy
Popularity Algorithm“Social Triangulation”(A * (# star ratings) + B * (# of dish mentions/total reviews at restaurant) + C...
Search WeightsWhich signals are more important: Number of times your search query matched something? 						 Your previous ...
Infrastructure
StackRunning on WindowsWeb/REST Tier in the CloudDedicated RDMS on Solid State DrivesC# & SQL Server		Python & NLTKSolr Lu...
Results“Vegas8 - RegEx 1”Raw RegEx, RackSpace Cloud, Shared CPU 5 classifiers ~1 record/sec 50MM Records = 50MM Seconds 14...
ResultsResults                            Sec/Record   Total Sec    Total Hours Total DaysRegEx 1                         ...
Improvements?Serialization - eats ~40-60% of overhead. How do weremove it?Dedicated Hardware - SSDs & Dedicated CPUParalle...
ExpertsPanel of Resident NutritionistsFormalizing things like: “What is Hangover Food“ “How to get Huge fast” “How to be a...
Final Thoughts
Trading ParallelsDynamic vs Static SystemsKnowledge/Signal Graph		 If you’re monitoring “Apple” youll need to monitor:		 -...
Data ScienceBurgeoning skill set:Data Programmer Sys admin Full stack knowledgeStats Probability Algorithms Empirical meth...
ResourcesData Science ToolkitNLTKNate Silver - The Signal and the Noise
Tim Shea @sheanineseventim@whatsgood.com
Upcoming SlideShare
Loading in...5
×

BigData and Algorithms - LA Algorithmic Trading

1,168

Published on

Slides from LA Algorithmic Trading event (http://www.meetup.com/LA-Algorithmic-Trading/events/98963812/) on using BigData and Algorithms in your business. Covers how What'sGood uses algorithms to allow users to make choices about food on the go and the "BigData" infrastructure we've built to support them.

Includes topics such as"big" data ingestion, in-stream processing, NLP algorithms, assessing "popularity", assigning relevancy weights in search, adding "dimensionality" to restaurant menus by cleansing public data sets, and mapping loosely correlated dataset into your own.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,168
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
38
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "BigData and Algorithms - LA Algorithmic Trading"

  1. 1. Algorithmic Trading(In Los Angeles)LA Algorithmic Trading Meetup - Winter 2013
  2. 2. What’s Good?
  3. 3. WelcomeTim Shea tim@whatsgood.com @sheanineseven Data Scientist Ad Agency Guy (Razorfish, Universal, TBWAChiatDay) Founder and CTO of WhatsGood.com Big interest in convergence of Tech and Finance communities
  4. 4. Elevator PitchDigital Menu Platform for picky eaters on-the-go.Data-centric POVSearch/Sort/Slice/Dice, Answering “What’s Good Here”?The “Good” in WhatsGood varies by person.
  5. 5. “Dimensionality”Hundreds of data points *behind* each menu item.This data is *hidden* by traditional analog menus.Dimensionality = Personalization.
  6. 6. Thomas Guide Thomas Guide :: Google Earth As Paper Menu :: What’sGood
  7. 7. Hypothesis
  8. 8. We’re Empiricists
  9. 9. ProblemNoise80/20 - In any scenario where you’re ordering food (ex. at-home, in-restaurant, etc) 80% of menu info is noise.Bad In-store. Worse when considering multiple locations.Paper menus dont help this situation at all.
  10. 10. ResultHuman error.Leads to:Frustration - “Ill just get what I usually get”Alienation - “I’m going out with my meat-eating friends, Ill just bring a granola bar”Accidents - “The waiter didnt know there was soy sauce in there, and I ended up inthe hospital”
  11. 11. HypothesisBigData + Machine Learning + The CrowdWill remove these pain points.And create something truly valuable for people.Literally improve the way we discover food, permanantly.
  12. 12. What’sGood Algos
  13. 13. ClydeStorm
  14. 14. ComponentsFoodNet + Vegas8 + Rhombus
  15. 15. ClydeStormMenu Ingestion - Every 2 weeks, reconcile 400,000Restaurants and 50MM Menu Items (Add/Edit/Delete)NLP Classifiers - Then, for every dish, we run 8 NLPclassifiers to determine (V,G,N,L,P,&Pop)Data Mapping - Orthoginal datasets that “dont quite fit”Search - Handles all the modern indexing and retrievaloperations consumers are accustomed to.
  16. 16. Vegas8Based on a simple human Intuition:“Signal Words” helps us make 1 of 3 determinations:1. Definitely Positive - “Vegan”: All bets are off, obviously vegan.2. Strongly Negative - “Ribeye Steak”: Pretty damn confident, not vegan.3. Fuzzy Signal - Not enough info, conflicting info, fuzzy signal.
  17. 17. The Intuition
  18. 18. FoodNetBased loosely on WordNet - Open Source Princeton projectLexical Knowledge Graph or word relations (vs a list)ex. Obviously “MILK” is a signal for “Contains Lactose”But so are all of its other permutations: - Synonyms - Hyper- & Hypo-nyms - Other languages - All the foods in the world that commonly use MILK as an ingredient
  19. 19. First Attempt
  20. 20. First VersionRead from Menu DB - 50MM Venue, Dish Title & DescriptionRead from Synonym DB - Slam it into a big RegExFor Each record - Any matches?Save Results
  21. 21. Results?Medoicre- Took forever to run- Unexpected results (think: RegEx)- Tons of edge cases
  22. 22. Algorithms and NLP
  23. 23. Stepping BackHow do we find better tools for the job?How do we measure any improvements we make?Is there a more “Algorithmic” approach?Such as Machine Learning in general, or NLP specifically?
  24. 24. Not #NLP*Not* Nuero Linguistic Programming
  25. 25. What is NLP?Natural Language ProcessingAttempt to formalize the ways in which humans understandlanguage, into a computer program.Slippery - We’re not accustomed to thinking about how weunderstand each other, we just do it.
  26. 26. Widely ApplicableSemantic Analysis - Whats the overall mood here?Text Classification - What is this document I’m reading?Knowledge Mapping - Which things relate to which?Info Extraction - What are the major topics discussed?
  27. 27. What’sGood Use Cases
  28. 28. 1. Similarity
  29. 29. Are these the same? A Frame, 12565 W Washington Blvd, Culver City, CA, 90066 A-Frame, 12565 Washington Blvd, Los Angeles, CA, 90066
  30. 30. This Problem Creme Brulee vs Crème brûlée vs Cr�e Bru001lee
  31. 31. This Problem
  32. 32. OrthogonalityRhombus - The What’sGood Decoder RingLibrary that attempts to resolve “Matching Problems” For Example: Public Calorie Database - Can I even use it?
  33. 33. TextGrounderDisambiguate:- Georgia vs GeorgiaContext:- Melrose Heights vs West Hollywood vs Los Angeles
  34. 34. 2. Sentiment
  35. 35. “Bag of Words”Type of Naive Bayes Classifer Tokenize Remove Stop Words Stemming the remaining words Frequency Distribution - How many times did this occur?
  36. 36. Edge CasesYelp Review - Comme Ca“You’d expect a place with such a diverse selection of french food,wonderfully accomodating staff, and a world class chef to live up to itsamazing reputation, but it just simply did not.”
  37. 37. Other Great TricksPart of Speech TaggingN-GramsLevenshtein DistanceRevMiner
  38. 38. Humans!!National Weather Service Tries to quantify the effect of humans: - Precipitation forecasts - 25% lift - Temperature forecasts - 10% liftTraders Need human judgement when a model is failing.
  39. 39. 3. Relevancy
  40. 40. Popularity Algorithm“Social Triangulation”(A * (# star ratings) + B * (# of dish mentions/total reviews at restaurant) + C * (# of photos/avg mentions per restaurant in specific geography)) * Arbitrary population weight
  41. 41. Search WeightsWhich signals are more important: Number of times your search query matched something? Your previous searches & behaviors? Does Proximity to you outweigh other factors? Does Popularity?
  42. 42. Infrastructure
  43. 43. StackRunning on WindowsWeb/REST Tier in the CloudDedicated RDMS on Solid State DrivesC# & SQL Server Python & NLTKSolr Lucene
  44. 44. Results“Vegas8 - RegEx 1”Raw RegEx, RackSpace Cloud, Shared CPU 5 classifiers ~1 record/sec 50MM Records = 50MM Seconds 14,000 Hours ~578 Days
  45. 45. ResultsResults Sec/Record Total Sec Total Hours Total DaysRegEx 1 1 50,000,000 13,888.89 578.70Tokenize 1 25 2,000,000 555.56 23.15Tokenize 2 (SSD & dedicated CPU) 110 454,545 126.26 5.26Tokenize 3 with 50MM caching 16 3,125,000 868.06 36.17Tokenize 4 with 10K caching 225 222,222 61.73 2.57Token/Stem/Stop 230 217,391 60.39 2.52Token/Stem/Stop w/ 4 parallel pro- 874 57,208 15.89 0.66cessesLevenstein/Weights/Biz Rules/Ha- ?? ?? ?? ??doop
  46. 46. Improvements?Serialization - eats ~40-60% of overhead. How do weremove it?Dedicated Hardware - SSDs & Dedicated CPUParallelization - Hadoop? RightScale? Custom Solution?Indexing - SQL “dumb” storage. Solr for search.
  47. 47. ExpertsPanel of Resident NutritionistsFormalizing things like: “What is Hangover Food“ “How to get Huge fast” “How to be a really annoying Yogi”
  48. 48. Final Thoughts
  49. 49. Trading ParallelsDynamic vs Static SystemsKnowledge/Signal Graph If you’re monitoring “Apple” youll need to monitor: - Apple, $APPL, Tim Cooke, iPhone, FOXCONN - And assign a signal weight and signal vector for eachOrthogonality Using loosely correlative systems
  50. 50. Data ScienceBurgeoning skill set:Data Programmer Sys admin Full stack knowledgeStats Probability Algorithms Empirical methodologyBusiness “Real world” knowledge Subjectivity Modeling uncertainty
  51. 51. ResourcesData Science ToolkitNLTKNate Silver - The Signal and the Noise
  52. 52. Tim Shea @sheanineseventim@whatsgood.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×