SlideShare a Scribd company logo
musweet.com
Handling Humongous Data Sets
from the Social Web

Grischa Andreew & Nader Cserny, compuccino
Agenda


• Einführung

• Technik

• Daten im Detail

• Abfragen

• Tools & Debugging

• Fragen
Einführung
Was ist musweet?
Media Stream
Themen
Analytics
Pro l
Statistik




7002  KÜNSTLER
                  20.848 3.914.259
                  SOCIAL MEDIA PROFILE                 STREAM ITEMS




                  ~3450
                    READ QUERIES / SEC
                                         309.855
                                         MEDIEN (AUDIO, VIDEO, FOTO)




                 ~15.000 INSERTS / DAY
                                         3.75 GB           DATA SIZE
Technik
Was brauchen wir?


KÜNSTLER            Name, City, Genre, Bild


SOCIAL PROFILE      Plattform (z.B. twitter), Link


MEDIA POSTS         Bilder, Videos, Audios, Statusmeldungen




PROFIL INFO         Freunde, Follower, Datum, Webseiten,
                    Pro lbild, Biographie, Label, etc.
MySQL Schema

                                                                                                                                Artist
                                                                                                                            id
                                                                                                                            name
                                                                                                                            Indexs
                                                                                                                            name




                                                                        Numbers
                                                                 artist
                                                                 socialprofile
                                                                 outgoing                                         Socialprofile                   artist_genres
                                                                 incoming                                        id                             artist
                                                                 feedback                                        artist                         genre
                                                                 push                                            url                            Indexes
       Service Informations Twitter
artist                                                           Indexes                                         service                        artist_genre
lang,                                                            artist_outgoing                                 Indexes
verified                                                          artist_incoming                                 artists_service
location,                                                        artist_feedback
id,                                                              artist_push
url,                                                             artist
created_at,
description,                                                                                                          Stream
time_zone,                                                                                                     artist                                Genres
profile_image_url,                                                                                              socialprofile                       id
screen_name                                Service Informations Facebook                                       message                            name
Indexes                               artist                                                                   created_at                         Indexes
artist                                category,                                                                Indexes                            name
                                      name,                                                                    message
                                      fan_count,                                                               artist_created_at
     Service Informations Myspace     bio,                                                                     created_at
artist                                url,
website,                              username,
genre,                                record_label,
location,                             location,
art_des_labels,                       profile_image_url,                                Stream Informations Facebook
                                                                                                                                 Stream Informations Twitter
headline,                             band_members,                                stream                                                                              Stream Informations Myspace
                                                                                                                            stream
created_at                            website,                                     name,                                                                         stream
                                                                                                                            source,
id,                                   ink,                                         caption,                                                                      category,
                                                                                                                            in_reply_to_status_id,
profile_image_url,                     pinnwand_posts,                              link,                                                                         image,
                                                                                                                            in_reply_to_user_id,
label                                 genre,                                       likes,                                                                        link,
                                                                                                                            truncated,
Indexes                               friends,                                     type,                                                                         source
                                                                                                                            deleted
artist                                id                                           icon                                                                          Indexes
                                                                                                                            Indexes
                                      Indexes                                      Indexes                                                                       stream
                                                                                                                            stream
                                      artist                                       stream
MongoDB Schema

                                        Artist
                 id (Object Id)
                 name (str)
                 genres (strict array)
                 socialprofiles (strict array)
                    service (dbref)
                    url (str)
                    numbers (strict array)
                     incoming
                     outgoing
                     push
                     feedback
                     date
                    meta (array)
                      (unterschiedliche Felder, ja nach Plattform)
                 Indexes
                 name,
                 genres,
                 socialprofiles.service,
                 socialprofiles.numbers



                                         Stream
                 id (Hash aus facebook / myspace / twitter id)
                 socialprofile (dbref)
                 genres (strict array) (redundanz der genres vom
                 artists um den stream direkt über genres
                 abzufragen)
                 data (array)
                    ( data from plattforms,
                      field message is a must have)
                 created_at (datetime)
                 Indexes
                 socialprofile,
                 genres,
                 data.message
Wie kommen wir an die Daten? (Einfach)



 Crawler                                 musweet

 • Verarbeitung von Links     Links      • Darstellung der Inhalte
 • Extraktion von Medien                 • Zuordnung Artist / Service
 • Aufbereitung der Inhalte

                              Daten
Daten im Detail
Künstler Pro l bei MySpace




"numbers" : {
     "outgoing" : 221665,
     "incoming" : 770355,
     "feedback" : 36862603,
     "push" : 0,
     "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)",
},
"meta" : {
     "website" : "http://www.snoopdogg.com",
     "genre" : "Hip Hop / Rap / R&B",
     "location" : "Long Beach, California Vereinigte Staaten von Amerika",
     "art_des_labels" : "Major",
     "headline" : "",
     "created_at" : "Sat Dec 11 2004 01:00:00 GMT+0100 (CET)",
     "id" : 6344278,
     "profile_image_url" : "http://c1.ac-images.myspacecdn.com/images02/130/
     m_9857dcca155247b69e1260e6e34cce3c.jpg",
     "label" : "Doggystyle / Priority"
}
Künstler Pro l bei twitter




"numbers" : {
	 "outgoing" : 1204,
	 "incoming" : 2030350,
	 "feedback" : 22750,
	 "push" : 3145,
           "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)"
},
"meta" : {
	 "lang" : "en",
	 "verified" : true,
	 "location" : "LBC",
	 "id" : 3004231,
	 "url" : "http://www.snoopdogg.com",
	 "created_at" : "Fri Mar 30 2007 21:05:42 GMT+0200 (CEST)",
	 "description" : "More Malice CD + DVD IN STORES NOW",
	 "time_zone" : "Pacific Time (US & Canada)",
	 "profile_image_url" : "http://a3.twimg.com/profile_images/1096549203/snoop_normal.jpg",
	 "screen_name" : "SnoopDogg"
}
Künstler Pro l bei facebook
"numbers" : {
	 "outgoing" : 0,
	 "incoming" : 2930860,
	 "feedback" : 0,
	 "push" : 0,
},
"meta" : {
	 "category" : "Musicians",
	 "name" : "Snoop Dogg",
	 "fan_count" : 2930860,
	 "bio" : "The offices at the top of the Capitol Records building in Hollywood are home to some
of Southern California’s most awe-inspiring views. ....",
	 "url" : "http://www.facebook.com/snoopdogg?v=info",
	 "username" : "snoopdogg",
	 "record_label" : "Priority/Doggystyle ",
	 "location" : "Long Beach, CA",
	 "profile_image_url" : "http://profile.ak.fbcdn.net/hprofile-ak-snc4/
hs622.snc3/27524_11455644806_1192_s.jpg",
	 "band_members" : "Snoop Dogg",
	 "website" : [
	 	 "http://www.snoopdogg.com",
	 	 "http://www.myspace.com/snoopdogg",
	 	 "http://twitter.com/snoopdogg"
	 ],
	 "link" : "http://www.facebook.com/snoopdogg",
	 "pinnwand_posts" : 0,
	 "genre" : "Hip Hop / Rap / R&B",
	 "friends" : 0,
	 "id" : "11455644806"
}
Abfragen
MySQL vs. MongoDB (1)

Alle Social Media Pro le mit Follower-Zahlen von einem Artist

MySQL                                    MongoDB
SELECT                                   db.artist.find( { "name": "Snoop Dogg" } )
	 n.incoming,
	 a.id as artist,
	 a.name as artist_name,                 Dauer: 0.0001 Sek.
	 s.id as socialprofile,
	 s.url as socialprofile_url,
FROM
	 numbers as n
	 JOIN socialprofile as s on s.id =
n.socialprofile
	 JOIN artist as a on a.id = n.artist
WHERE
	 a.name = "Snoop Dogg"
ORDER n.incoming DESC


Dauer: 0.0288 Sek.
MySQL vs. MongoDB (2)

10 HipHop Musiker mit den meisten Followern

MySQL                                    MongoDB
SELECT                                   db.artist.find( {
	 n.incoming,                              "genre": DBRef("genre","hiphop")
	 a.id as artist,                        } ).sort( {
	 a.name as artist_name,                   "socialprofiles.numbers.incoming": -1
	 s.id as socialprofile,                 } ).limit(10)
	 s.url as socialprofile_url,
FROM
                                         Dauer: 0.0230 Sek.
	 numbers as n
	 JOIN artist_genres as ag on
ag.artists = n.artist
   JOIN genres as g on g.id = ag.genre
   JOIN socialprofile as s on s.id =
n.socialprofile
   JOIN artist as a on a.id = n.artist
WHERE
	 g.name = "Hip/Hop"
ORDER BY
         n.incoming DESC
LIMIT 10

Dauer: 0.8741 Sek.
MySQL Index


• Index wird von links nach rechts gelesen

  Reihenfolge wichtig

  Felder: „artist“, „incoming“, „push“, „date“

  SELECT   *   FROM   numbers   WHERE   artist =   1   Funktioniert
  SELECT   *   FROM   numbers   WHERE   incoming   =   1 Funktioniert nicht
  SELECT   *   FROM   numbers   WHERE   artist =   1   AND push < 10 Funktioniert nicht
  SELECT   *   FROM   numbers   WHERE   artist =   1   AND push < 10 AND incoming > 0 Funktioniert




• Index Debugging

  EXPLAIN SELECT * FROM numbers WHERE artist = 1
MongoDB Index


• Index Reihenfolge ist egal

  kann ein Feld mitten im Index verwenden

  db.artist.ensureIndex( {"name":1, "numbers": -1 } );


  db.artist.find( { "name": "Snoop Dog" } ) Funktioniert
  db.artist.find( { "socialprofiles.numbers.incoming": { "$gte": 10 } } ) Funktioniert
  db.artist.find( {
    "name": "Snoop Dogg",
    "socialprofiles.numbers.incoming": { "$gte": 0 }
  } ) Funktioniert




• Index Debugging

  db.artist.find( { "name": "Snoop Dogg" } ).explain()
Tools & Debugging
MongoDB Fehlermeldungen


• Sortierte Abfrage ohne Limit:
  Fehler: „too much data for sort() with no index. add an index or
  specify a smaller limit“

  Lösung: Feld in den Index aufnehmen



• Duplicate Key Error:
  Fehler: in älteren Versionen (< 1.6.0) schmiert DB bei zu vielen
  Duplicate Key Errors ab

  Lösung: Upsert verwenden
db.serverStatus()


Wieviel memory-Verbrauch, wieviele Connections, ...



globalLock           Wie lange Collections gesperrt waren, ...

connections          Wieviel Verbindungen offen / verfügbar, ...

backgroundFlushing Wann war der letzte Flush auf die Festplatte, ...

...                  Mehr Info in der Dokumentation:
                     http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics
Pro ling



db.setPro lingLevel(0) off

                       log slow operations (>100ms), optional „slow“
db.setPro lingLevel(1)
                       de nieren mit db.setPro lingLevel(1, 10)
db.setPro lingLevel(2) log all operations



system.pro le
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b5ea) } nscanned:1 moved ", "millis"   : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b468) } nscanned:1 moved ", "millis"   : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607fd9a68299079400c067) } nscanned:1", "millis" : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b6a6) } nscanned:1 moved ", "millis"   : 0 }
Collection Objekte analysieren

Download: http://github.com/compuccino/mongodb-ac
Abschließend...
Abschließend...


• Fragen?



• Mehr über uns:

  http://compuccino.com

  http://facebook.com/compuccino



• Personen:

  Grischa Andreew, @grischaandreew

  Nader Cserny, @nadr

More Related Content

Recently uploaded

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 

Recently uploaded (20)

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

musweet.com: Handling Humongous Data Sets from the Social Web

  • 1. musweet.com Handling Humongous Data Sets from the Social Web Grischa Andreew & Nader Cserny, compuccino
  • 2. Agenda • Einführung • Technik • Daten im Detail • Abfragen • Tools & Debugging • Fragen
  • 3.
  • 10. Statistik 7002 KÜNSTLER 20.848 3.914.259 SOCIAL MEDIA PROFILE STREAM ITEMS ~3450 READ QUERIES / SEC 309.855 MEDIEN (AUDIO, VIDEO, FOTO) ~15.000 INSERTS / DAY 3.75 GB DATA SIZE
  • 11.
  • 13. Was brauchen wir? KÜNSTLER Name, City, Genre, Bild SOCIAL PROFILE Plattform (z.B. twitter), Link MEDIA POSTS Bilder, Videos, Audios, Statusmeldungen PROFIL INFO Freunde, Follower, Datum, Webseiten, Pro lbild, Biographie, Label, etc.
  • 14. MySQL Schema Artist id name Indexs name Numbers artist socialprofile outgoing Socialprofile artist_genres incoming id artist feedback artist genre push url Indexes Service Informations Twitter artist Indexes service artist_genre lang, artist_outgoing Indexes verified artist_incoming artists_service location, artist_feedback id, artist_push url, artist created_at, description, Stream time_zone, artist Genres profile_image_url, socialprofile id screen_name Service Informations Facebook message name Indexes artist created_at Indexes artist category, Indexes name name, message fan_count, artist_created_at Service Informations Myspace bio, created_at artist url, website, username, genre, record_label, location, location, art_des_labels, profile_image_url, Stream Informations Facebook Stream Informations Twitter headline, band_members, stream Stream Informations Myspace stream created_at website, name, stream source, id, ink, caption, category, in_reply_to_status_id, profile_image_url, pinnwand_posts, link, image, in_reply_to_user_id, label genre, likes, link, truncated, Indexes friends, type, source deleted artist id icon Indexes Indexes Indexes Indexes stream stream artist stream
  • 15. MongoDB Schema Artist id (Object Id) name (str) genres (strict array) socialprofiles (strict array) service (dbref) url (str) numbers (strict array) incoming outgoing push feedback date meta (array) (unterschiedliche Felder, ja nach Plattform) Indexes name, genres, socialprofiles.service, socialprofiles.numbers Stream id (Hash aus facebook / myspace / twitter id) socialprofile (dbref) genres (strict array) (redundanz der genres vom artists um den stream direkt über genres abzufragen) data (array) ( data from plattforms, field message is a must have) created_at (datetime) Indexes socialprofile, genres, data.message
  • 16. Wie kommen wir an die Daten? (Einfach) Crawler musweet • Verarbeitung von Links Links • Darstellung der Inhalte • Extraktion von Medien • Zuordnung Artist / Service • Aufbereitung der Inhalte Daten
  • 17.
  • 19. Künstler Pro l bei MySpace "numbers" : { "outgoing" : 221665, "incoming" : 770355, "feedback" : 36862603, "push" : 0, "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)", }, "meta" : { "website" : "http://www.snoopdogg.com", "genre" : "Hip Hop / Rap / R&B", "location" : "Long Beach, California Vereinigte Staaten von Amerika", "art_des_labels" : "Major", "headline" : "", "created_at" : "Sat Dec 11 2004 01:00:00 GMT+0100 (CET)", "id" : 6344278, "profile_image_url" : "http://c1.ac-images.myspacecdn.com/images02/130/ m_9857dcca155247b69e1260e6e34cce3c.jpg", "label" : "Doggystyle / Priority" }
  • 20. Künstler Pro l bei twitter "numbers" : { "outgoing" : 1204, "incoming" : 2030350, "feedback" : 22750, "push" : 3145, "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)" }, "meta" : { "lang" : "en", "verified" : true, "location" : "LBC", "id" : 3004231, "url" : "http://www.snoopdogg.com", "created_at" : "Fri Mar 30 2007 21:05:42 GMT+0200 (CEST)", "description" : "More Malice CD + DVD IN STORES NOW", "time_zone" : "Pacific Time (US & Canada)", "profile_image_url" : "http://a3.twimg.com/profile_images/1096549203/snoop_normal.jpg", "screen_name" : "SnoopDogg" }
  • 21. Künstler Pro l bei facebook "numbers" : { "outgoing" : 0, "incoming" : 2930860, "feedback" : 0, "push" : 0, }, "meta" : { "category" : "Musicians", "name" : "Snoop Dogg", "fan_count" : 2930860, "bio" : "The offices at the top of the Capitol Records building in Hollywood are home to some of Southern California’s most awe-inspiring views. ....", "url" : "http://www.facebook.com/snoopdogg?v=info", "username" : "snoopdogg", "record_label" : "Priority/Doggystyle ", "location" : "Long Beach, CA", "profile_image_url" : "http://profile.ak.fbcdn.net/hprofile-ak-snc4/ hs622.snc3/27524_11455644806_1192_s.jpg", "band_members" : "Snoop Dogg", "website" : [ "http://www.snoopdogg.com", "http://www.myspace.com/snoopdogg", "http://twitter.com/snoopdogg" ], "link" : "http://www.facebook.com/snoopdogg", "pinnwand_posts" : 0, "genre" : "Hip Hop / Rap / R&B", "friends" : 0, "id" : "11455644806" }
  • 22.
  • 24. MySQL vs. MongoDB (1) Alle Social Media Pro le mit Follower-Zahlen von einem Artist MySQL MongoDB SELECT db.artist.find( { "name": "Snoop Dogg" } ) n.incoming, a.id as artist, a.name as artist_name, Dauer: 0.0001 Sek. s.id as socialprofile, s.url as socialprofile_url, FROM numbers as n JOIN socialprofile as s on s.id = n.socialprofile JOIN artist as a on a.id = n.artist WHERE a.name = "Snoop Dogg" ORDER n.incoming DESC Dauer: 0.0288 Sek.
  • 25. MySQL vs. MongoDB (2) 10 HipHop Musiker mit den meisten Followern MySQL MongoDB SELECT db.artist.find( { n.incoming, "genre": DBRef("genre","hiphop") a.id as artist, } ).sort( { a.name as artist_name, "socialprofiles.numbers.incoming": -1 s.id as socialprofile, } ).limit(10) s.url as socialprofile_url, FROM Dauer: 0.0230 Sek. numbers as n JOIN artist_genres as ag on ag.artists = n.artist JOIN genres as g on g.id = ag.genre JOIN socialprofile as s on s.id = n.socialprofile JOIN artist as a on a.id = n.artist WHERE g.name = "Hip/Hop" ORDER BY n.incoming DESC LIMIT 10 Dauer: 0.8741 Sek.
  • 26. MySQL Index • Index wird von links nach rechts gelesen Reihenfolge wichtig Felder: „artist“, „incoming“, „push“, „date“ SELECT * FROM numbers WHERE artist = 1 Funktioniert SELECT * FROM numbers WHERE incoming = 1 Funktioniert nicht SELECT * FROM numbers WHERE artist = 1 AND push < 10 Funktioniert nicht SELECT * FROM numbers WHERE artist = 1 AND push < 10 AND incoming > 0 Funktioniert • Index Debugging EXPLAIN SELECT * FROM numbers WHERE artist = 1
  • 27. MongoDB Index • Index Reihenfolge ist egal kann ein Feld mitten im Index verwenden db.artist.ensureIndex( {"name":1, "numbers": -1 } ); db.artist.find( { "name": "Snoop Dog" } ) Funktioniert db.artist.find( { "socialprofiles.numbers.incoming": { "$gte": 10 } } ) Funktioniert db.artist.find( { "name": "Snoop Dogg", "socialprofiles.numbers.incoming": { "$gte": 0 } } ) Funktioniert • Index Debugging db.artist.find( { "name": "Snoop Dogg" } ).explain()
  • 28.
  • 30. MongoDB Fehlermeldungen • Sortierte Abfrage ohne Limit: Fehler: „too much data for sort() with no index. add an index or specify a smaller limit“ Lösung: Feld in den Index aufnehmen • Duplicate Key Error: Fehler: in älteren Versionen (< 1.6.0) schmiert DB bei zu vielen Duplicate Key Errors ab Lösung: Upsert verwenden
  • 31. db.serverStatus() Wieviel memory-Verbrauch, wieviele Connections, ... globalLock Wie lange Collections gesperrt waren, ... connections Wieviel Verbindungen offen / verfügbar, ... backgroundFlushing Wann war der letzte Flush auf die Festplatte, ... ... Mehr Info in der Dokumentation: http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics
  • 32. Pro ling db.setPro lingLevel(0) off log slow operations (>100ms), optional „slow“ db.setPro lingLevel(1) de nieren mit db.setPro lingLevel(1, 10) db.setPro lingLevel(2) log all operations system.pro le { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b5ea) } nscanned:1 moved ", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b468) } nscanned:1 moved ", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607fd9a68299079400c067) } nscanned:1", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b6a6) } nscanned:1 moved ", "millis" : 0 }
  • 33. Collection Objekte analysieren Download: http://github.com/compuccino/mongodb-ac
  • 34.
  • 36. Abschließend... • Fragen? • Mehr über uns: http://compuccino.com http://facebook.com/compuccino • Personen: Grischa Andreew, @grischaandreew Nader Cserny, @nadr

Editor's Notes

  1. Erweiterbarkeit und Handling von gro&amp;#xDF;en Datenmengen im Rahmen unseres Projekts musweet.com
  2. Was ist musweet? Warum wir uns f&amp;#xFC;r MongoDB entschieden haben Vergleich zw. dem alten System mit MySQL u. MongoDB Interessante Abfragen Welche Tools &amp; Debugging Methoden wir verwenden
  3. Website rund um Musik und deren Akteure im Social Web misst und bewertet Online-Aktivit&amp;#xE4;t in Echtzeit analysiert Datenquellen und stellt diese dar zeigt Fotos, Musik, Videos von Bands u. Musikern Erfahrungen von wahl.de mit MySQL jetzt mit MongoDB bei musweet.com umgesetzt
  4. Media Stream mit Link Expander (=Enthaltene Medien werden direkt auf der Seite dargestellt) Aktuell crawlen wir myspace, facebook, twitter -&gt; sp&amp;#xE4;ter erweiterung auf blogs, youtube Stream nach Genre filterbar
  5. Meist diskutierte Themen der letzten 7 Tage
  6. Wer hat die meisten Freunde dazugewonnen (Big Mover) Wer die meisten Nachrichten geschrieben (Big Shaker) Filterbar nach Genre Tagesaktuell
  7. Stamminformationen eines K&amp;#xFC;nstlers Social Media Profile =&gt; Wo bewegt sich der Musiker im Netz Media Stream vom Musiker Zuk&amp;#xFC;nftige Konzerte Related Artists: &amp;#xE4;hnliche im Genre und &amp;#xE4;hnliche Kontaktzahlen
  8. Wachsende Datenbasis Aktivit&amp;#xE4;t aus dem Social Web verlangt hohe Performance bei den Inserts Erstmal mit bekannten K&amp;#xFC;nstlern gestartet, sp&amp;#xE4;ter Erweiterung
  9. Wir haben K&amp;#xFC;nstler mit versch. Social Profiles die jeweils wieder unterschiedliche Profile / Stream Informationen haben der Stream / die Profileinformationen sollen nach den Attributen (genres,..) vom K&amp;#xFC;nstler sortierbar sein
  10. F&amp;#xFC;r jeden weiteren Service brauchen wir zwei Tabellen ( Profileinformation, Stream ) mehr, f&amp;#xFC;r jedes weitere Attribut beim K&amp;#xFC;nstler / Scoialprofile was mehrdimensional sein soll brauchen wir eine Join und einen Daten Tabelle ( artist -&gt; artists_genres -&gt; genres ). Durch die vielen Tabellen ist es nicht einfach die Daten abzufragen / jede &amp;#xC4;nderung muss im Backend und im Frontend implementiert werden
  11. Drastisch reduziertes Schema m&amp;#xF6;glich Neues Attribut erfordert nur einen neuen Eintrag im Objekt (ohne dass man an die DB ran muss) die &amp;#xC4;nderungen werden im Backend implementiert, das Frontend muss nicht ge&amp;#xE4;ndert werden.
  12. Crawler ist eine eigenst&amp;#xE4;ndige Application und verwaltet die Crawls f&amp;#xFC;r mehrere Client-Apps wie musweet.com. musweet.com registriert die Socialprofiles im Crawler und bekommt eine Push Notfication wenn sich ein Profil &amp;#xE4;ndert oder eine neue Nachricht geschrieben wird.
  13. numbers Object ist festgesetzt und immer gleich aufgebaut meta Object ist mit plattformspezifischen Daten gef&amp;#xFC;llt.
  14. Bei Twitter haben wir andere Infos als bei Myspace &amp;#x201E;profile_image_url&amp;#x201C; bezeichnet das Profil-Bild des K&amp;#xFC;nstlers auf der Plattform.
  15. Bei Facebook haben wir meist mehr Informationen als bei den anderen Plattformen, je nach Facebook Account Type (Fanpage/User Profile)
  16. MySQL: entweder mit JOIN oder 3 SELECTs MongoDB Abfragen gestalten sich viel einfacher und performanter
  17. MySQL: Noch mehr JOINs oder SELECT statements MongoDB mit DBRef auf Genre
  18. Viele unterschiedliche Indizes notwendig =&gt; viele GB an Daten
  19. Indizes platzsparender und einfacher anwendbar MongoDB kann in einem Index nur einen multiindex (Array als Daten) haben
  20. Fehlermeldungen die wir w&amp;#xE4;hrend der Entwicklung hatten Fehler &amp;#x201E;too much data for sort()&amp;#x201C; tritt erst sp&amp;#xE4;ter auf, wenn man viele Daten in der DB hat
  21. globalLock: wie lange gesperrt mem: wieviel Speicher verbraucht wird IndexCounters: wieviele Hits, wieviele Misses connections: wieviele offen, wieviele verf&amp;#xFC;gbar opcounters: wieviel inserts, updates, deletes backgroundFlushing: wann war der letzte Flush
  22. langsame Datenbank-Abfragen oder alle Abfragen Profiling auf Datenbank-Ebene
  23. N&amp;#xFC;tzliches Tool um herauszufinden wieviele unterschiedliche Objekt Strukturen man in der Collection hat und deren Aufbau zu sehen.