SlideShare a Scribd company logo
1 of 68
Download to read offline
How to make super
awesome web apps
     Deepak Thukral
        @iapain




                      Europython 2011
Computer Science




                   Europython 2011
I <3 data
   and
 Python




            Europython 2011
Why Python?




              Europython 2011
Because its awesome




                      Europython 2011
Awesome web frameworks




                         Europython 2011
Awesome web framework in Python
   Django

   Cherrypy

   Turbogears

   Pylons

   Zope

   Your fav web framework

                              Europython 2011
Awesome web app formula




                          Europython 2011
Awesome Idea
             +
Awesome web framework(s)/tools
             +
        Awesome UI
             =
   Awesome Web app

                                 Europython 2011
But It was not so easy 15 years back




                                  Europython 2011
Awesome website in 1996




Source: Wayback machine                             Europython 2011
Two years later ... in 1998




                              Europython 2011
It’s super awesome and still it is




                                     Europython 2011
Why we need intelligence in web apps?




                                  Europython 2011
Information Overload




                                   Massive Data: Wicked Hard
Source: Stefano Bussolon (flickr)                               Europython 2011
Information Overload rate



~ 130 M websites

~ 150 M tweets per day

~ 2 M pics on twitter per day

~ 4 M blogpost at WordPress per day



                                      Europython 2011
Why we need intelligence in web apps


    To find interesting information from data.

    To do something new in your web apps.

    To stay in competition in long run.

    Lots of opportunity.




                                                Europython 2011
What is Machine Learning (ML)?




                                 Europython 2011
What is Machine Learning (ML)?


Algorithms that allows machines to learn




                                           Europython 2011
Examples

Spam detection (Classification)

Google News (Clustering, Recommendation)

Facebook friend recommendation
(Recommendation/Collaborative filtering)

Decide price for your products (Nearest
Neighbor)

and so on ...


                                           Europython 2011
Machine Learning



Supervised           Unsupervised

  Classification        Clustering

  ANN,                 Self
  regression           organized
  etc                  maps (SOM)




                                    Europython 2011
ML modules in Python



Over 20 python modules (MDP, PyBrain,
SciKit.learn, NLTK, Orange, LibSVM, Elefant
etc)

But ...




                                              Europython 2011
None of them covers all.

MDP is pure python.

Scikit.learn looks more promising for real
word problems.




                                             Europython 2011
Hardest part is to extract features that
       these modules can read it.




                                    Europython 2011
Clustering




Source: NASA (NGC 869)                Europython 2011
Clustering

Hierarchical

k-means

Quality Threshold

Fuzzy c-means

Locality sensitive hashing




                             Europython 2011
Pick one: k-means




We have N items and we need to cluster
them into K groups.




                                         Europython 2011
Randomly select k items as centroids

                                       Europython 2011
Compute distance between centroid and
   items and assign them to cluster
                                        Europython 2011
Adjust the centroids

                       Europython 2011
Re-run until it converge

                           Europython 2011
Example: News aggregator web app




  Example: Write a simple web app in Django
  which aggregate news articles from different
  sources.




                                                 Europython 2011
Example: News aggregator web app
# models.py
class NewsSource(models.Model):
   name = models.CharField(max_length=50)
   url = models.URLField()




class News(models.Model)
   title = models.CharField(max_length=100)
   content = models.TextField()
   source = models.ForeignKey(“NewsSource”)
   cluster = models.ForeignKey(“Cluster”, null=True, blank=True)
   ...




class Cluster(models.Model)
   created = models.DateTimeField(auto_save_add=True)
   ...




                                                                   Europython 2011
And then an awesome UI - It’s
        IMPORTANT




                                Europython 2011
Let’s make it super awesome by
clustering news articles into related
               stories.




                                    Europython 2011
Step 1: Finding distance between stories




                                     Europython 2011
Story 1: Germany eases stance, boosting hope for
                  Greek aid


Story 2: Private sector needed in Greek aid deal,
                  Germany says


     Looks similar, they should be in same cluster


                                                     Europython 2011
Distance(A, B) = 1 - Similarity(A, B)

                      or

     Distance(A, B) = 1 - n(A ∩ B)/n(A U B)


Also known as Jaccard Coefficient, you may also
   use minHash (it’s quick when you have high
                 dimensionality)

                                              Europython 2011
Story 1: Germany eases stance, boosting hope for
                  Greek aid


Story 2: Private sector needed in Greek aid deal,
                  Germany says

 Distance(story1, story2) = 1 - 3/12 (ignore stop
                   words)= 0.75




                                               Europython 2011
Python code:


import re
from django.utils.stopwords import strip_stopwords

def jaccard_distance(item1, item2):
  """
  A simple distance function (curse of dimensionality applies)
  """
  #Tokenize string into bag of words

  feature1 = set(re.findall('w+', strip_stopwords("%s %s" %
                        (item1.title.lower(), item1.content.lower())))[:100])
  feature2 = set(re.findall('w+', strip_stopwords("%s %s" %
                        (item2.title.lower(), item2.content.lower())))[:100])

  similarity = 1.0*len(feature1.intersection(feature2))/len(feature1.union(feature2))

  return 1 - similarity




                                                                                        Europython 2011
Brainstorming




How many clusters are required?

What kind of clustering is required?




                                       Europython 2011
Modify k-means

Not every clusters have to contain items, we
need quality (threshold)

It allows us to have arbitrary k and we can
discard empty clusters.

It also allows us to maintain quality in
clusters.

Actually, it’s more similar to Canopy
Clustering.


                                               Europython 2011
Python code:


class Cluster(object):
   """
   Clustering class
   """
   def __init__(self, items, distance_function=jaccard_distance):
       self.distance = distance_function
       self.items = items

  def kmeans(self, k=10, threshold=0.80):
    "k is number of clusters, threshold is minimum acceptable distance"

     #pick k random stories and make then centroid
     centroids = random.sample(self.items, k)

     #remove centroid from collection
     items = list(set(self.items) - set(centroids))

     ...




                                                                          Europython 2011
Python code:

last_matches = None
     # Max. 50 iterations for convergence.
     for t in range(50):
        # Make k empty clusters
        best_matches = [[] for c in centroids]
        min_distance = 1 # it's max value of distance

       # Find which centroid is the closest for each row
       for item in items:
          best_center = 0
          min_distance = 1.0 #max
          minima = -1
          for centroid in centroids:
             minima+=1
             distance = self.distance(item, centroid)
             if distance <= min_distance:
                best_center = minima
                min_distance = distance
          # maintain quality of your cluster
          if min_distance <= threshold:#threshold
             best_matches[best_center].append(item)

       # If the results are the same as last time, this is complete
       if best_matches == last_matches:
          break
       last_matches = best_matches
         # Move best_matches to new centroids...


                                                                      Europython 2011
Sweet!




         Europython 2011
More examples



Handwriting recognition

Image Segmentation

Marker Clustering on Maps

Common patterns in your sale.




                                Europython 2011
Classification




Source: Danny Nicholson (flickr)                   Europython 2011
Classification


We present some observations and outcomes
to classifier.

Then we present an unseen observation and
ask classifier to predict outcome.

Essentially we want to classify a new record
based on probabilities estimated from the
training data.


                                               Europython 2011
Naïve Bayes Classifier
                P(green) = 40/60 = 2/3
                P(red) = 20/60 = 1/3




              P(green in vicinity) = 1/40
              P(red in vicinity) = 3/20




   P(white|green) = 1/40*2/3 = 1/60
   P(white|red) = 3/20*1/3 = 4/60


                                            Europython 2011
Example: Tweet Sentiment




Example: Write a simple web app in Django
which tells mood of the tweet (sentiment
analysis).




                                            Europython 2011
Example: Tweet sentiment


# models.py
CHOICES = ((‘H’, ‘Happy’),
              (‘S’, ‘Sad’),
              (‘N’, ‘Neutral’),
              )

class Tweet(models.Model):
   author = models.CharField(max_length=50)
   tweet = models.TextField(max_length=140)
   sentiment = models.CharField(max_length=1, default=’N’, choices=CHOICES)
   url = models.URLField()
   ...




                                                                              Europython 2011
Example: Tweet sentiment


# classifier.py
from classifier import BayesClassifier

happy = [“happy”, “awesome”, “amazing”, “impressed”, “:)”]
sad = [“sad”, “sucks”, “problem”, “headache”, “:(“]

classifier = BayesClassifier()
classifier.train(happy, “happy”)
classifier.train(sad, “sad”)

classifier.guess(“Europython is awesome this year”) #happy
classifier.guess(“It’s sad I can’t make it to europython this year”) #sad
...




                                                                           Europython 2011
Happy



Pack done, tomorrow fly Frankfurt Hahn -
  Pisa . Ready for a great week to the
              @europython




                                          Europython 2011
Sad




  Bag too small for running shoes. No
running for me at #europython I guess




                                        Europython 2011
Enhancements: Tweet sentiment



Bag of words

Human computing




                                Europython 2011
Recommendation (Collaborative
        Filtering)




                                Europython 2011
User based or Item based?




                            Europython 2011
User based recommendation


Similar User: Relating user with other users.
(Pearson correlation)

Easy: Quite easy to implement.

Slow: For large datasets It’s slow.

Make sense: For social sites.




                                                Europython 2011
Item based recommendation



Items for user: Recommend items to a user.

Faster: It’s faster when dataset is huge
(Amazon, Netflix etc).

Make sense: Recommending products.




                                             Europython 2011
Why item based CF is fast?


Users changes very often but items does not.




                                          Europython 2011
Examples




Last.fm

Facebook, Twitter friend recommendation

Google Ads, Facebook Ads etc.




                                          Europython 2011
Scared?



scikit-learn (Python)

Google Prediction API (Restful)

Apache Mahout (Java / JPype)

Amazon Mechanical Turk (Human computing)



                                           Europython 2011
Limitations


Accuracy: Oh la la. Not always accurate.
(Locality)

Computation: In large dataset it might
require lots of computation.

Wicked Hard Problems: In some cases it’s
just wickedly hard.

And many more...


                                           Europython 2011
Awesome web app checklist


An awesome idea

An awesome architecture (scalable)

An awesome web framework (probably
Django)

An awesome UI

An awesome Cloud solution (if required)


                                          Europython 2011
Super awesome checklist


An awesome idea

An awesome architecture (scalable)

An awesome web framework (probably Django)

An awesome UI


Do interesting thing with you data.
An awesome Cloud solution (if required)




                                             Europython 2011
This books is awesome




                        Europython 2011
Other useful techniques



Support Vector Machines (Classification)

Canopy Clustering

Locality sensitive hashing

k-dimensional trees



                                          Europython 2011
You’re Awesome - Q A




https://github.com/iapain/smartwebapps

                                         Europython 2011

More Related Content

Viewers also liked

Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)John Collins
 
mobile development platforms
mobile development platformsmobile development platforms
mobile development platformsguestfa9375
 
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideSizzlynRose
 
Stc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsStc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsDavid Sommer
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsRobert Douglas
 
Sample of instructions
Sample of instructionsSample of instructions
Sample of instructionsDavid Sommer
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationJohn Collins
 
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)John Collins
 
Building Quality Experiences for Users in Any Language
Building Quality Experiences for Users in Any LanguageBuilding Quality Experiences for Users in Any Language
Building Quality Experiences for Users in Any LanguageJohn Collins
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThoughtWorks
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentaryalghanim
 
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsHeatherRivers
 
Open Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital BroadcastingOpen Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital BroadcastingFrancois Lefebvre
 
Sample email submission
Sample email submissionSample email submission
Sample email submissionDavid Sommer
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3David Sommer
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Javajbellis
 
Bank Account Of Life
Bank Account Of LifeBank Account Of Life
Bank Account Of LifeNafass
 

Viewers also liked (20)

Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
 
Glossary
GlossaryGlossary
Glossary
 
mobile development platforms
mobile development platformsmobile development platforms
mobile development platforms
 
Shrunken Head
 Shrunken Head  Shrunken Head
Shrunken Head
 
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU Decide
 
Stc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsStc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kits
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile Platforms
 
Sample of instructions
Sample of instructionsSample of instructions
Sample of instructions
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful Localization
 
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)
 
Silmeyiniz
SilmeyinizSilmeyiniz
Silmeyiniz
 
Building Quality Experiences for Users in Any Language
Building Quality Experiences for Users in Any LanguageBuilding Quality Experiences for Users in Any Language
Building Quality Experiences for Users in Any Language
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj Kumar
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary
 
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with Rails
 
Open Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital BroadcastingOpen Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital Broadcasting
 
Sample email submission
Sample email submissionSample email submission
Sample email submission
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Bank Account Of Life
Bank Account Of LifeBank Account Of Life
Bank Account Of Life
 

Similar to How to make intelligent web apps

Transfer learning, active learning using tensorflow object detection api
Transfer learning, active learning  using tensorflow object detection apiTransfer learning, active learning  using tensorflow object detection api
Transfer learning, active learning using tensorflow object detection api설기 김
 
[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...
[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...
[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...EnlightenmentProject
 
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!Bruno Capuano
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2sadhana312471
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningOswald Campesato
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningOswald Campesato
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...Wee Hyong Tok
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019 Alexis Agahi
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Oswald Campesato
 
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Annibale Panichella
 
Mining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-IIMining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-IIAbhay Prakash
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Domain specific languages and Scala
Domain specific languages and ScalaDomain specific languages and Scala
Domain specific languages and ScalaFilip Krikava
 

Similar to How to make intelligent web apps (20)

Transfer learning, active learning using tensorflow object detection api
Transfer learning, active learning  using tensorflow object detection apiTransfer learning, active learning  using tensorflow object detection api
Transfer learning, active learning using tensorflow object detection api
 
[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...
[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...
[E-Dev-Day 2014][4/16] Review of Eolian, Eo, Bindings, Interfaces and What's ...
 
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
2019 05 11 Chicago Codecamp - Deep Learning for everyone? Challenge Accepted!
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Java
JavaJava
Java
 
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
Bootstrap Custom Image Classification using Transfer Learning by Danielle Dea...
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019
 
Python for dummies
Python for dummiesPython for dummies
Python for dummies
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 
Object
ObjectObject
Object
 
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
 
Mining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-IIMining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia PART-II
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Domain specific languages and Scala
Domain specific languages and ScalaDomain specific languages and Scala
Domain specific languages and Scala
 

How to make intelligent web apps

  • 1. How to make super awesome web apps Deepak Thukral @iapain Europython 2011
  • 2. Computer Science Europython 2011
  • 3. I <3 data and Python Europython 2011
  • 4. Why Python? Europython 2011
  • 5. Because its awesome Europython 2011
  • 6. Awesome web frameworks Europython 2011
  • 7. Awesome web framework in Python Django Cherrypy Turbogears Pylons Zope Your fav web framework Europython 2011
  • 8. Awesome web app formula Europython 2011
  • 9. Awesome Idea + Awesome web framework(s)/tools + Awesome UI = Awesome Web app Europython 2011
  • 10. But It was not so easy 15 years back Europython 2011
  • 11. Awesome website in 1996 Source: Wayback machine Europython 2011
  • 12. Two years later ... in 1998 Europython 2011
  • 13. It’s super awesome and still it is Europython 2011
  • 14. Why we need intelligence in web apps? Europython 2011
  • 15. Information Overload Massive Data: Wicked Hard Source: Stefano Bussolon (flickr) Europython 2011
  • 16. Information Overload rate ~ 130 M websites ~ 150 M tweets per day ~ 2 M pics on twitter per day ~ 4 M blogpost at WordPress per day Europython 2011
  • 17. Why we need intelligence in web apps To find interesting information from data. To do something new in your web apps. To stay in competition in long run. Lots of opportunity. Europython 2011
  • 18. What is Machine Learning (ML)? Europython 2011
  • 19. What is Machine Learning (ML)? Algorithms that allows machines to learn Europython 2011
  • 20. Examples Spam detection (Classification) Google News (Clustering, Recommendation) Facebook friend recommendation (Recommendation/Collaborative filtering) Decide price for your products (Nearest Neighbor) and so on ... Europython 2011
  • 21. Machine Learning Supervised Unsupervised Classification Clustering ANN, Self regression organized etc maps (SOM) Europython 2011
  • 22. ML modules in Python Over 20 python modules (MDP, PyBrain, SciKit.learn, NLTK, Orange, LibSVM, Elefant etc) But ... Europython 2011
  • 23. None of them covers all. MDP is pure python. Scikit.learn looks more promising for real word problems. Europython 2011
  • 24. Hardest part is to extract features that these modules can read it. Europython 2011
  • 25. Clustering Source: NASA (NGC 869) Europython 2011
  • 27. Pick one: k-means We have N items and we need to cluster them into K groups. Europython 2011
  • 28. Randomly select k items as centroids Europython 2011
  • 29. Compute distance between centroid and items and assign them to cluster Europython 2011
  • 30. Adjust the centroids Europython 2011
  • 31. Re-run until it converge Europython 2011
  • 32. Example: News aggregator web app Example: Write a simple web app in Django which aggregate news articles from different sources. Europython 2011
  • 33. Example: News aggregator web app # models.py class NewsSource(models.Model): name = models.CharField(max_length=50) url = models.URLField() class News(models.Model) title = models.CharField(max_length=100) content = models.TextField() source = models.ForeignKey(“NewsSource”) cluster = models.ForeignKey(“Cluster”, null=True, blank=True) ... class Cluster(models.Model) created = models.DateTimeField(auto_save_add=True) ... Europython 2011
  • 34. And then an awesome UI - It’s IMPORTANT Europython 2011
  • 35. Let’s make it super awesome by clustering news articles into related stories. Europython 2011
  • 36. Step 1: Finding distance between stories Europython 2011
  • 37. Story 1: Germany eases stance, boosting hope for Greek aid Story 2: Private sector needed in Greek aid deal, Germany says Looks similar, they should be in same cluster Europython 2011
  • 38. Distance(A, B) = 1 - Similarity(A, B) or Distance(A, B) = 1 - n(A ∩ B)/n(A U B) Also known as Jaccard Coefficient, you may also use minHash (it’s quick when you have high dimensionality) Europython 2011
  • 39. Story 1: Germany eases stance, boosting hope for Greek aid Story 2: Private sector needed in Greek aid deal, Germany says Distance(story1, story2) = 1 - 3/12 (ignore stop words)= 0.75 Europython 2011
  • 40. Python code: import re from django.utils.stopwords import strip_stopwords def jaccard_distance(item1, item2): """ A simple distance function (curse of dimensionality applies) """ #Tokenize string into bag of words feature1 = set(re.findall('w+', strip_stopwords("%s %s" % (item1.title.lower(), item1.content.lower())))[:100]) feature2 = set(re.findall('w+', strip_stopwords("%s %s" % (item2.title.lower(), item2.content.lower())))[:100]) similarity = 1.0*len(feature1.intersection(feature2))/len(feature1.union(feature2)) return 1 - similarity Europython 2011
  • 41. Brainstorming How many clusters are required? What kind of clustering is required? Europython 2011
  • 42. Modify k-means Not every clusters have to contain items, we need quality (threshold) It allows us to have arbitrary k and we can discard empty clusters. It also allows us to maintain quality in clusters. Actually, it’s more similar to Canopy Clustering. Europython 2011
  • 43. Python code: class Cluster(object): """ Clustering class """ def __init__(self, items, distance_function=jaccard_distance): self.distance = distance_function self.items = items def kmeans(self, k=10, threshold=0.80): "k is number of clusters, threshold is minimum acceptable distance" #pick k random stories and make then centroid centroids = random.sample(self.items, k) #remove centroid from collection items = list(set(self.items) - set(centroids)) ... Europython 2011
  • 44. Python code: last_matches = None # Max. 50 iterations for convergence. for t in range(50): # Make k empty clusters best_matches = [[] for c in centroids] min_distance = 1 # it's max value of distance # Find which centroid is the closest for each row for item in items: best_center = 0 min_distance = 1.0 #max minima = -1 for centroid in centroids: minima+=1 distance = self.distance(item, centroid) if distance <= min_distance: best_center = minima min_distance = distance # maintain quality of your cluster if min_distance <= threshold:#threshold best_matches[best_center].append(item) # If the results are the same as last time, this is complete if best_matches == last_matches: break last_matches = best_matches # Move best_matches to new centroids... Europython 2011
  • 45. Sweet! Europython 2011
  • 46. More examples Handwriting recognition Image Segmentation Marker Clustering on Maps Common patterns in your sale. Europython 2011
  • 47. Classification Source: Danny Nicholson (flickr) Europython 2011
  • 48. Classification We present some observations and outcomes to classifier. Then we present an unseen observation and ask classifier to predict outcome. Essentially we want to classify a new record based on probabilities estimated from the training data. Europython 2011
  • 49. Naïve Bayes Classifier P(green) = 40/60 = 2/3 P(red) = 20/60 = 1/3 P(green in vicinity) = 1/40 P(red in vicinity) = 3/20 P(white|green) = 1/40*2/3 = 1/60 P(white|red) = 3/20*1/3 = 4/60 Europython 2011
  • 50. Example: Tweet Sentiment Example: Write a simple web app in Django which tells mood of the tweet (sentiment analysis). Europython 2011
  • 51. Example: Tweet sentiment # models.py CHOICES = ((‘H’, ‘Happy’), (‘S’, ‘Sad’), (‘N’, ‘Neutral’), ) class Tweet(models.Model): author = models.CharField(max_length=50) tweet = models.TextField(max_length=140) sentiment = models.CharField(max_length=1, default=’N’, choices=CHOICES) url = models.URLField() ... Europython 2011
  • 52. Example: Tweet sentiment # classifier.py from classifier import BayesClassifier happy = [“happy”, “awesome”, “amazing”, “impressed”, “:)”] sad = [“sad”, “sucks”, “problem”, “headache”, “:(“] classifier = BayesClassifier() classifier.train(happy, “happy”) classifier.train(sad, “sad”) classifier.guess(“Europython is awesome this year”) #happy classifier.guess(“It’s sad I can’t make it to europython this year”) #sad ... Europython 2011
  • 53. Happy Pack done, tomorrow fly Frankfurt Hahn - Pisa . Ready for a great week to the @europython Europython 2011
  • 54. Sad Bag too small for running shoes. No running for me at #europython I guess Europython 2011
  • 55. Enhancements: Tweet sentiment Bag of words Human computing Europython 2011
  • 56. Recommendation (Collaborative Filtering) Europython 2011
  • 57. User based or Item based? Europython 2011
  • 58. User based recommendation Similar User: Relating user with other users. (Pearson correlation) Easy: Quite easy to implement. Slow: For large datasets It’s slow. Make sense: For social sites. Europython 2011
  • 59. Item based recommendation Items for user: Recommend items to a user. Faster: It’s faster when dataset is huge (Amazon, Netflix etc). Make sense: Recommending products. Europython 2011
  • 60. Why item based CF is fast? Users changes very often but items does not. Europython 2011
  • 61. Examples Last.fm Facebook, Twitter friend recommendation Google Ads, Facebook Ads etc. Europython 2011
  • 62. Scared? scikit-learn (Python) Google Prediction API (Restful) Apache Mahout (Java / JPype) Amazon Mechanical Turk (Human computing) Europython 2011
  • 63. Limitations Accuracy: Oh la la. Not always accurate. (Locality) Computation: In large dataset it might require lots of computation. Wicked Hard Problems: In some cases it’s just wickedly hard. And many more... Europython 2011
  • 64. Awesome web app checklist An awesome idea An awesome architecture (scalable) An awesome web framework (probably Django) An awesome UI An awesome Cloud solution (if required) Europython 2011
  • 65. Super awesome checklist An awesome idea An awesome architecture (scalable) An awesome web framework (probably Django) An awesome UI Do interesting thing with you data. An awesome Cloud solution (if required) Europython 2011
  • 66. This books is awesome Europython 2011
  • 67. Other useful techniques Support Vector Machines (Classification) Canopy Clustering Locality sensitive hashing k-dimensional trees Europython 2011
  • 68. You’re Awesome - Q A https://github.com/iapain/smartwebapps Europython 2011

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n