How to make intelligent web apps

  • 2,473 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,473
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
70
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. How to make superawesome web apps Deepak Thukral @iapain Europython 2011
  • 2. Computer Science Europython 2011
  • 3. I <3 data and Python Europython 2011
  • 4. Why Python? Europython 2011
  • 5. Because its awesome Europython 2011
  • 6. Awesome web frameworks Europython 2011
  • 7. Awesome web framework in Python Django Cherrypy Turbogears Pylons Zope Your fav web framework Europython 2011
  • 8. Awesome web app formula Europython 2011
  • 9. Awesome Idea +Awesome web framework(s)/tools + Awesome UI = Awesome Web app Europython 2011
  • 10. But It was not so easy 15 years back Europython 2011
  • 11. Awesome website in 1996Source: Wayback machine Europython 2011
  • 12. Two years later ... in 1998 Europython 2011
  • 13. It’s super awesome and still it is Europython 2011
  • 14. Why we need intelligence in web apps? Europython 2011
  • 15. Information Overload Massive Data: Wicked HardSource: Stefano Bussolon (flickr) Europython 2011
  • 16. Information Overload rate~ 130 M websites~ 150 M tweets per day~ 2 M pics on twitter per day~ 4 M blogpost at WordPress per day Europython 2011
  • 17. Why we need intelligence in web apps To find interesting information from data. To do something new in your web apps. To stay in competition in long run. Lots of opportunity. Europython 2011
  • 18. What is Machine Learning (ML)? Europython 2011
  • 19. What is Machine Learning (ML)?Algorithms that allows machines to learn Europython 2011
  • 20. ExamplesSpam detection (Classification)Google News (Clustering, Recommendation)Facebook friend recommendation(Recommendation/Collaborative filtering)Decide price for your products (NearestNeighbor)and so on ... Europython 2011
  • 21. Machine LearningSupervised Unsupervised Classification Clustering ANN, Self regression organized etc maps (SOM) Europython 2011
  • 22. ML modules in PythonOver 20 python modules (MDP, PyBrain,SciKit.learn, NLTK, Orange, LibSVM, Elefantetc)But ... Europython 2011
  • 23. None of them covers all.MDP is pure python.Scikit.learn looks more promising for realword problems. Europython 2011
  • 24. Hardest part is to extract features that these modules can read it. Europython 2011
  • 25. ClusteringSource: NASA (NGC 869) Europython 2011
  • 26. ClusteringHierarchicalk-meansQuality ThresholdFuzzy c-meansLocality sensitive hashing Europython 2011
  • 27. Pick one: k-meansWe have N items and we need to clusterthem into K groups. Europython 2011
  • 28. Randomly select k items as centroids Europython 2011
  • 29. Compute distance between centroid and items and assign them to cluster Europython 2011
  • 30. Adjust the centroids Europython 2011
  • 31. Re-run until it converge Europython 2011
  • 32. Example: News aggregator web app Example: Write a simple web app in Django which aggregate news articles from different sources. Europython 2011
  • 33. Example: News aggregator web app# models.pyclass NewsSource(models.Model): name = models.CharField(max_length=50) url = models.URLField()class News(models.Model) title = models.CharField(max_length=100) content = models.TextField() source = models.ForeignKey(“NewsSource”) cluster = models.ForeignKey(“Cluster”, null=True, blank=True) ...class Cluster(models.Model) created = models.DateTimeField(auto_save_add=True) ... Europython 2011
  • 34. And then an awesome UI - It’s IMPORTANT Europython 2011
  • 35. Let’s make it super awesome byclustering news articles into related stories. Europython 2011
  • 36. Step 1: Finding distance between stories Europython 2011
  • 37. Story 1: Germany eases stance, boosting hope for Greek aidStory 2: Private sector needed in Greek aid deal, Germany says Looks similar, they should be in same cluster Europython 2011
  • 38. Distance(A, B) = 1 - Similarity(A, B) or Distance(A, B) = 1 - n(A ∩ B)/n(A U B)Also known as Jaccard Coefficient, you may also use minHash (it’s quick when you have high dimensionality) Europython 2011
  • 39. Story 1: Germany eases stance, boosting hope for Greek aidStory 2: Private sector needed in Greek aid deal, Germany says Distance(story1, story2) = 1 - 3/12 (ignore stop words)= 0.75 Europython 2011
  • 40. Python code:import refrom django.utils.stopwords import strip_stopwordsdef jaccard_distance(item1, item2): """ A simple distance function (curse of dimensionality applies) """ #Tokenize string into bag of words feature1 = set(re.findall(w+, strip_stopwords("%s %s" % (item1.title.lower(), item1.content.lower())))[:100]) feature2 = set(re.findall(w+, strip_stopwords("%s %s" % (item2.title.lower(), item2.content.lower())))[:100]) similarity = 1.0*len(feature1.intersection(feature2))/len(feature1.union(feature2)) return 1 - similarity Europython 2011
  • 41. BrainstormingHow many clusters are required?What kind of clustering is required? Europython 2011
  • 42. Modify k-meansNot every clusters have to contain items, weneed quality (threshold)It allows us to have arbitrary k and we candiscard empty clusters.It also allows us to maintain quality inclusters.Actually, it’s more similar to CanopyClustering. Europython 2011
  • 43. Python code:class Cluster(object): """ Clustering class """ def __init__(self, items, distance_function=jaccard_distance): self.distance = distance_function self.items = items def kmeans(self, k=10, threshold=0.80): "k is number of clusters, threshold is minimum acceptable distance" #pick k random stories and make then centroid centroids = random.sample(self.items, k) #remove centroid from collection items = list(set(self.items) - set(centroids)) ... Europython 2011
  • 44. Python code:last_matches = None # Max. 50 iterations for convergence. for t in range(50): # Make k empty clusters best_matches = [[] for c in centroids] min_distance = 1 # its max value of distance # Find which centroid is the closest for each row for item in items: best_center = 0 min_distance = 1.0 #max minima = -1 for centroid in centroids: minima+=1 distance = self.distance(item, centroid) if distance <= min_distance: best_center = minima min_distance = distance # maintain quality of your cluster if min_distance <= threshold:#threshold best_matches[best_center].append(item) # If the results are the same as last time, this is complete if best_matches == last_matches: break last_matches = best_matches # Move best_matches to new centroids... Europython 2011
  • 45. Sweet! Europython 2011
  • 46. More examplesHandwriting recognitionImage SegmentationMarker Clustering on MapsCommon patterns in your sale. Europython 2011
  • 47. ClassificationSource: Danny Nicholson (flickr) Europython 2011
  • 48. ClassificationWe present some observations and outcomesto classifier.Then we present an unseen observation andask classifier to predict outcome.Essentially we want to classify a new recordbased on probabilities estimated from thetraining data. Europython 2011
  • 49. Naïve Bayes Classifier P(green) = 40/60 = 2/3 P(red) = 20/60 = 1/3 P(green in vicinity) = 1/40 P(red in vicinity) = 3/20 P(white|green) = 1/40*2/3 = 1/60 P(white|red) = 3/20*1/3 = 4/60 Europython 2011
  • 50. Example: Tweet SentimentExample: Write a simple web app in Djangowhich tells mood of the tweet (sentimentanalysis). Europython 2011
  • 51. Example: Tweet sentiment# models.pyCHOICES = ((‘H’, ‘Happy’), (‘S’, ‘Sad’), (‘N’, ‘Neutral’), )class Tweet(models.Model): author = models.CharField(max_length=50) tweet = models.TextField(max_length=140) sentiment = models.CharField(max_length=1, default=’N’, choices=CHOICES) url = models.URLField() ... Europython 2011
  • 52. Example: Tweet sentiment# classifier.pyfrom classifier import BayesClassifierhappy = [“happy”, “awesome”, “amazing”, “impressed”, “:)”]sad = [“sad”, “sucks”, “problem”, “headache”, “:(“]classifier = BayesClassifier()classifier.train(happy, “happy”)classifier.train(sad, “sad”)classifier.guess(“Europython is awesome this year”) #happyclassifier.guess(“It’s sad I can’t make it to europython this year”) #sad... Europython 2011
  • 53. HappyPack done, tomorrow fly Frankfurt Hahn - Pisa . Ready for a great week to the @europython Europython 2011
  • 54. Sad Bag too small for running shoes. Norunning for me at #europython I guess Europython 2011
  • 55. Enhancements: Tweet sentimentBag of wordsHuman computing Europython 2011
  • 56. Recommendation (Collaborative Filtering) Europython 2011
  • 57. User based or Item based? Europython 2011
  • 58. User based recommendationSimilar User: Relating user with other users.(Pearson correlation)Easy: Quite easy to implement.Slow: For large datasets It’s slow.Make sense: For social sites. Europython 2011
  • 59. Item based recommendationItems for user: Recommend items to a user.Faster: It’s faster when dataset is huge(Amazon, Netflix etc).Make sense: Recommending products. Europython 2011
  • 60. Why item based CF is fast?Users changes very often but items does not. Europython 2011
  • 61. ExamplesLast.fmFacebook, Twitter friend recommendationGoogle Ads, Facebook Ads etc. Europython 2011
  • 62. Scared?scikit-learn (Python)Google Prediction API (Restful)Apache Mahout (Java / JPype)Amazon Mechanical Turk (Human computing) Europython 2011
  • 63. LimitationsAccuracy: Oh la la. Not always accurate.(Locality)Computation: In large dataset it mightrequire lots of computation.Wicked Hard Problems: In some cases it’sjust wickedly hard.And many more... Europython 2011
  • 64. Awesome web app checklistAn awesome ideaAn awesome architecture (scalable)An awesome web framework (probablyDjango)An awesome UIAn awesome Cloud solution (if required) Europython 2011
  • 65. Super awesome checklistAn awesome ideaAn awesome architecture (scalable)An awesome web framework (probably Django)An awesome UIDo interesting thing with you data.An awesome Cloud solution (if required) Europython 2011
  • 66. This books is awesome Europython 2011
  • 67. Other useful techniquesSupport Vector Machines (Classification)Canopy ClusteringLocality sensitive hashingk-dimensional trees Europython 2011
  • 68. You’re Awesome - Q Ahttps://github.com/iapain/smartwebapps Europython 2011