SlideShare a Scribd company logo
1 of 28
Download to read offline
flux of meme - final report
            telecom italia, milan 30.9.11
            thomas alisi
            @grudelsud




Friday, September 30, 11
the basics




Friday, September 30, 11
the idea

                  Meme: a postulated unit or element of cultural ideas transmitted from one mind to
                  another through speech or similar phenomena.


                  Zeitgeist: German language expression referring to "the spirit of the times"


                  Semantic Web: an evolving development of the World Wide Web in which the
                  meaning (semantics) of information on the web is defined, making it possible for
                  machines to process it


                  Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated
                  and shared on social media mainly via mobile networks


Friday, September 30, 11
background

                  yahoo research
                           WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts
                           WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes
                           CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog
                           Conversations - Shamma, Kennedy, Churchill


                  others
                           WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee,
                           Park, Moon
                           Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty
                           Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How
                           Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei

Friday, September 30, 11
algorithm steps




                           1. fetch data   2. create clusters   3. extract topics   4. analyze stats




Friday, September 30, 11
implementation




Friday, September 30, 11
step 1. fetch data!


                  using the free Spritzer access to
                  Twitter streaming API (~1% of total
                  tweets)
                  defined set of location boxes (Italy, UK,
                  France, Spain)
                  reinforcing locations with geonames
                  didn’t prove to be efficient (origin: from
                  a galaxy far far away)
                  enrich content through web scraping,
                  also carrying meta & opengraph
                  keywords
                  blacklist of noisy sources


Friday, September 30, 11
step 2. create geo-clusters




                  create time slices
                  select all the posts within a time slice
                  choose geo-granularity (radius of clusters)
                  agglomerate posts with Hierarchical
                  Agglomerative Clustering (HAC)




Friday, September 30, 11
step 3. extract topics
                  a geo-cluster represents the whole bag of word used to define
                  a document
                  topic extraction is implemented with LDA
                           α Dirichlet prior param. on the per-document topic
                           distributions (frontend output: weight)
                           β Dirichlet prior param on the per-topic word distribution
                           θi is the topic distribution for document i,
                           zij is the topic for the jth word in document i, and
                           wij is the specific word.
                  user defined params:
                           number of topics,
                           number of words per topic,
                           min followers

Friday, September 30, 11
step 4. analyze data




                  define search context: topics or keywords
                  perform live search with TF-IDF indicators
                  display time-lapse of clusters’ analytics
                  evolution (log-scale count and average size)
                  quick and easy interface: toggle visibility of
                  clusters




Friday, September 30, 11
step 4. analyze data




                  drag and zoom on specific location boxes
                  select time interval
                  display aggregated stats of clusters (count
                  and size) within location box
                  show and export breakdown of posts’
                  languages




Friday, September 30, 11
step 4. analyze data

                                   show stats and content of
                                   specific clusters
                                     lat-lon of centroids, std.
                                     deviation, surface and
                                     radius
                                   display weighted topics,
                                   TF-IDF of terms within
                                   topics, TF-IDF of meta
                                   keywords
                                   show / export list of posts
                                   show related links




Friday, September 30, 11
step 4. analyze data




                                   show query metrics and
                                   parameters
                                   display overall TF-IDF for
                                   the selected query




Friday, September 30, 11
demo
            http://fom.londondroids.com/fom/




Friday, September 30, 11
sorry guys, now the boring stuff...
            backend, front-end API, cron jobs




Friday, September 30, 11
Backend
                  Streaming API
                           a batch process is constantly
                           running and saving data on the
                           db
                           options: fetch by search query,
                           expand terms with wikiminer,
                           access all the stream, filter
                           geotagged, filter location box,
                           fetch related content
                  Clustering and Topic extraction
                           define geo granularity
                           time/size of geo clusters
                           followers and retweets
                           number of topics / keywords
                           language mapping

Friday, September 30, 11
API




                  search clusters containing
                  specific topics / keywords
                  returns lists of clusters
                  ordered by topic weight
                  all the data extraction API
                  conforms to a RESTful
                  model and returns JSON
                  structured data




Friday, September 30, 11
API




                  read list of geographic
                  clusters
                  usually called after a search
                  topic has been raised




Friday, September 30, 11
API




                  read semantic content of a
                  geographic cluster
                  topics group by score (alpha
                  parameter in LDA) and word
                  weighted with TF-IDF with
                  respect to the whole cluster
                  content




Friday, September 30, 11
API




                  read meta / opengraph
                  content of a geographic
                  cluster




Friday, September 30, 11
API
                  export list of posts
                           exports all the posts contained in a cluster
                           example request: /cluster/export_posts/1026/csv
                  read post content
                           reads the content of a post
                           example request: /cluster/read_post/560951
                  read related link
                           read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)
                           example request: /cluster/read_link/16268
                  execute cluster stats within a location box
                           read list of clusters contained within a location box and creates stat charts (in form of google chart images)
                           example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33
                  execute post stats within a location box
                           read list of posts contained within a location box and perform stats on languages
                           example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/
                           neLon=11.33
                  read query content
                           reads the list of geo-clusters associated to a specific query id (usually fetched by the function above)
                           example request: /cluster/read/2


Friday, September 30, 11
Cron




                  keep everything running
                           restart the streaming API
                           now and then, so as to
                           keep twitter happy
                           create the clusters at the
                           end of the day




Friday, September 30, 11
Friday, September 30, 11
servers




Friday, September 30, 11
final thoughts




Friday, September 30, 11
improvements

                  optimize time slicing!
                           emerging topics should be checked on hourly basis among the complete dataset
                  train models!
                           a training set would be ideal to create models and optimize performances of the topic
                           extraction algorithm
                           models could relate to specific context in order to improve results (e.g. all the tweets from
                           newspapers)
                  create language classifiers
                           increase the precision of language detection with naive bayes classifiers
                  think of scalability
                           increasing the amount of data makes it necessary to scale up to Map/Reduce architectures
                  increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)
                  enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)

Friday, September 30, 11
other refs

                  algorithms
                           LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
                           HAC - http://en.wikipedia.org/wiki/Cluster_analysis
                  libraries
                           twitter 4 java - http://twitter4j.org
                           machine learning - http://mallet.cs.umass.edu/
                           jquery (core + ui) - http://jquery.org/
                           data tables - http://datatables.net/
                           chart api - http://code.google.com/apis/chart/
                  image courtesy
                           http://yesyesno.com/nike-city-runs

Friday, September 30, 11
?
            thanks!
                  codebase source + wiki https://github.com/grudelsud/fom
                  thomas alisi
                  @grudelsud
                  giuseppe serra
                  @giuseppeserra
                  marco bertini
                  @bertinimarco




Friday, September 30, 11

More Related Content

Viewers also liked

Viewers also liked (8)

Flux of MEME
Flux of MEMEFlux of MEME
Flux of MEME
 
Flux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semesterFlux of MEME - DOW 1st semester
Flux of MEME - DOW 1st semester
 
WGVU PBS Kids
WGVU PBS KidsWGVU PBS Kids
WGVU PBS Kids
 
MMSC
MMSCMMSC
MMSC
 
Using Websites
Using WebsitesUsing Websites
Using Websites
 
Legacies of ancient greece
Legacies of ancient greeceLegacies of ancient greece
Legacies of ancient greece
 
MMS Consulting
MMS ConsultingMMS Consulting
MMS Consulting
 
The river valley civilizations
The river valley civilizationsThe river valley civilizations
The river valley civilizations
 

Similar to Flux of MEME - final report

Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Carsten Saathoff
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Ralf Stockmann
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelMarcia Zeng
 
Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for dataJisc
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
WP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - GradmannWP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - GradmannEuropeana
 
semantic and social (intra)webs
semantic and social (intra)webssemantic and social (intra)webs
semantic and social (intra)websFabien Gandon
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkPaolo Nesi
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinalDeborah McGuinness
 
HyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED TalksHyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED TalksMariella Sabatino
 
Ch11 OS
Ch11 OSCh11 OS
Ch11 OSC.U
 
Searching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsSearching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsTakashi Kobayashi
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...George Thomas
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaStuart Chalk
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 

Similar to Flux of MEME - final report (20)

Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data model
 
Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for data
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
WP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - GradmannWP3 Further specification of Functionality and Interoperability - Gradmann
WP3 Further specification of Functionality and Interoperability - Gradmann
 
semantic and social (intra)webs
semantic and social (intra)webssemantic and social (intra)webs
semantic and social (intra)webs
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
 
Indexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social NetworkIndexing and Searching Cross Media Content in a Social Network
Indexing and Searching Cross Media Content in a Social Network
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
HyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED TalksHyperTED - Searching and browsing through fragments of TED Talks
HyperTED - Searching and browsing through fragments of TED Talks
 
OSCh11
OSCh11OSCh11
OSCh11
 
OS_Ch11
OS_Ch11OS_Ch11
OS_Ch11
 
Ch11 OS
Ch11 OSCh11 OS
Ch11 OS
 
Searching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsSearching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile Relationships
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...Implementing the Open Government Directive using the technologies of the Soci...
Implementing the Open Government Directive using the technologies of the Soci...
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Flux of MEME - final report

  • 1. flux of meme - final report telecom italia, milan 30.9.11 thomas alisi @grudelsud Friday, September 30, 11
  • 3. the idea Meme: a postulated unit or element of cultural ideas transmitted from one mind to another through speech or similar phenomena. Zeitgeist: German language expression referring to "the spirit of the times" Semantic Web: an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated and shared on social media mainly via mobile networks Friday, September 30, 11
  • 4. background yahoo research WWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, Watts WSDM2011 - Who Uses Web Search for What? And How? - Weber, Jaimes CSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog Conversations - Shamma, Kennedy, Churchill others WWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee, Park, Moon Tech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, Lafferty Tech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei Friday, September 30, 11
  • 5. algorithm steps 1. fetch data 2. create clusters 3. extract topics 4. analyze stats Friday, September 30, 11
  • 7. step 1. fetch data! using the free Spritzer access to Twitter streaming API (~1% of total tweets) defined set of location boxes (Italy, UK, France, Spain) reinforcing locations with geonames didn’t prove to be efficient (origin: from a galaxy far far away) enrich content through web scraping, also carrying meta & opengraph keywords blacklist of noisy sources Friday, September 30, 11
  • 8. step 2. create geo-clusters create time slices select all the posts within a time slice choose geo-granularity (radius of clusters) agglomerate posts with Hierarchical Agglomerative Clustering (HAC) Friday, September 30, 11
  • 9. step 3. extract topics a geo-cluster represents the whole bag of word used to define a document topic extraction is implemented with LDA α Dirichlet prior param. on the per-document topic distributions (frontend output: weight) β Dirichlet prior param on the per-topic word distribution θi is the topic distribution for document i, zij is the topic for the jth word in document i, and wij is the specific word. user defined params: number of topics, number of words per topic, min followers Friday, September 30, 11
  • 10. step 4. analyze data define search context: topics or keywords perform live search with TF-IDF indicators display time-lapse of clusters’ analytics evolution (log-scale count and average size) quick and easy interface: toggle visibility of clusters Friday, September 30, 11
  • 11. step 4. analyze data drag and zoom on specific location boxes select time interval display aggregated stats of clusters (count and size) within location box show and export breakdown of posts’ languages Friday, September 30, 11
  • 12. step 4. analyze data show stats and content of specific clusters lat-lon of centroids, std. deviation, surface and radius display weighted topics, TF-IDF of terms within topics, TF-IDF of meta keywords show / export list of posts show related links Friday, September 30, 11
  • 13. step 4. analyze data show query metrics and parameters display overall TF-IDF for the selected query Friday, September 30, 11
  • 14. demo http://fom.londondroids.com/fom/ Friday, September 30, 11
  • 15. sorry guys, now the boring stuff... backend, front-end API, cron jobs Friday, September 30, 11
  • 16. Backend Streaming API a batch process is constantly running and saving data on the db options: fetch by search query, expand terms with wikiminer, access all the stream, filter geotagged, filter location box, fetch related content Clustering and Topic extraction define geo granularity time/size of geo clusters followers and retweets number of topics / keywords language mapping Friday, September 30, 11
  • 17. API search clusters containing specific topics / keywords returns lists of clusters ordered by topic weight all the data extraction API conforms to a RESTful model and returns JSON structured data Friday, September 30, 11
  • 18. API read list of geographic clusters usually called after a search topic has been raised Friday, September 30, 11
  • 19. API read semantic content of a geographic cluster topics group by score (alpha parameter in LDA) and word weighted with TF-IDF with respect to the whole cluster content Friday, September 30, 11
  • 20. API read meta / opengraph content of a geographic cluster Friday, September 30, 11
  • 21. API export list of posts exports all the posts contained in a cluster example request: /cluster/export_posts/1026/csv read post content reads the content of a post example request: /cluster/read_post/560951 read related link read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above) example request: /cluster/read_link/16268 execute cluster stats within a location box read list of clusters contained within a location box and creates stat charts (in form of google chart images) example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33 execute post stats within a location box read list of posts contained within a location box and perform stats on languages example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/ neLon=11.33 read query content reads the list of geo-clusters associated to a specific query id (usually fetched by the function above) example request: /cluster/read/2 Friday, September 30, 11
  • 22. Cron keep everything running restart the streaming API now and then, so as to keep twitter happy create the clusters at the end of the day Friday, September 30, 11
  • 26. improvements optimize time slicing! emerging topics should be checked on hourly basis among the complete dataset train models! a training set would be ideal to create models and optimize performances of the topic extraction algorithm models could relate to specific context in order to improve results (e.g. all the tweets from newspapers) create language classifiers increase the precision of language detection with naive bayes classifiers think of scalability increasing the amount of data makes it necessary to scale up to Map/Reduce architectures increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...) enhance analysis and visualization (e.g. reinforce topic correlation / n-grams) Friday, September 30, 11
  • 27. other refs algorithms LDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation HAC - http://en.wikipedia.org/wiki/Cluster_analysis libraries twitter 4 java - http://twitter4j.org machine learning - http://mallet.cs.umass.edu/ jquery (core + ui) - http://jquery.org/ data tables - http://datatables.net/ chart api - http://code.google.com/apis/chart/ image courtesy http://yesyesno.com/nike-city-runs Friday, September 30, 11
  • 28. ? thanks! codebase source + wiki https://github.com/grudelsud/fom thomas alisi @grudelsud giuseppe serra @giuseppeserra marco bertini @bertinimarco Friday, September 30, 11