SlideShare a Scribd company logo
1 of 22
Download to read offline
 	
  	
  	
  	
  	
  	
  Building	
  Social	
  Analy/cs	
  Tool	
  with	
  MongoDB	
  -­‐	
  
A	
  Developer's	
  Perspec/ve
1.  Product	
  Overview	
  
2.  Why	
  MongoDB	
  for	
  us?	
  
3.  Aggrega?on	
  Queries	
  to	
  the	
  rescue	
  
4.  How	
  Javascript	
  helped	
  us?	
  
5.  Experiences	
  with	
  Indexes	
  
6.  In-­‐progress	
  use-­‐cases	
  
7.  Tips	
  &	
  Tricks	
  
8.  Demo	
  
Agenda
Abhishek	
  Tejpaul	
  
	
  
SoUware	
  Developer	
  @	
  IntelliGrape	
  SoUware	
  
	
  
Loves	
  Grails,	
  Git	
  and	
  Linux	
  
	
  
abhishek@intelligrape.com	
  
About me
DataSiU	
  
Instagram	
  
Web	
  
Crawler1	
  
Web	
  
Crawler…	
  
mongoDB
Product Overview – Information Flow
Product Overview – Results
Product Overview – Results
Product Overview – Results
•  Schema-­‐less	
  data.	
  Typical	
  data	
  sources	
  
	
  
•  Adding	
  new	
  social	
  pla4orms	
  in	
  future	
  
•  Needed	
  fast	
  read-­‐write	
  opera6ons	
  
Why MongoDB for us?
Aggregation Queries – Getting Insights
•  Combina6on	
  of	
  queries	
  chained	
  together	
  
•  At	
  every	
  stage,	
  we	
  can	
  filter/chain/massage	
  data	
  
	
  
Image	
  credit:	
  h@ps://www.openshiC.com/blogs/an-­‐overview-­‐of-­‐whats-­‐new-­‐in-­‐mongodb-­‐22	
  
Our use-case (esp. for graphs)
•  Sen6ment	
  Analysis	
  
•  Demographic	
  Analysis	
  
•  Ar6cle	
  Analysis	
  
•  Plan	
  
•  Crea?on	
  of	
  Intelligence	
  tables	
  in	
  advance	
  
•  Reality	
  
•  On-­‐the-­‐fly	
  analysis	
  using	
  Aggrega6on	
  queries	
  
How to go about it?
•  Operates	
  on	
  a	
  single	
  collec6on	
  	
  
•  Think	
  about	
  data	
  you	
  have	
  and	
  insights	
  you	
  want	
  
•  Focus	
  on	
  reducing	
  data	
  size	
  early	
  on	
  
•  $match	
  
•  $project	
  
•  $sort	
  
•  $limit,	
  $skip	
  
•  Example
db.collec?onName.aggregate(	
  
	
  { 	
  "$match" 	
  : 	
  { 	
  fieldName	
  :	
   	
  matchingValue 	
   	
  },	
  
	
  { 	
  "$project"	
  : 	
  { 	
  	
  oldOrNewField:	
  fieldValue 	
   	
   	
  }},	
  
	
  { 	
  "$group" 	
  : 	
  { 	
  fieldName	
  : 	
  oldOrNewField,	
  "sum": 	
  {"$sum":1}}},	
  
	
  { 	
  "$sort" 	
  : 	
   	
  { 	
  "sum" 	
  : 	
  -­‐1 	
  }},	
  
	
  { 	
  "$limit" 	
  : 	
  20 	
  })	
  
	
  
Javascript Capabilities
•  All	
  the	
  programming	
  capabili6es	
  of	
  Javascript	
  language	
  at	
  your	
  
disposal	
  
•  Taking	
  business	
  logic	
  /	
  processing	
  to	
  your	
  data-­‐store	
  
Javascript – Our use-cases
•  Remove	
  garbage	
  data	
  at	
  DB	
  level	
  
•  Twijer	
  wrong	
  results	
  
•  Filtering	
  out	
  STOP	
  keywords	
  
	
  
	
  db.IgnoreList.findOne().stopWords.forEach(	
  func?on(data)	
  {	
  
	
   	
  db.ProcessedAr?cle.update(	
  
	
   	
   	
  { 	
  "isAc?ve"	
  : 	
  true,	
  "isIgnored" 	
  : 	
  {"$ne":true} 	
  },	
  	
  
	
   	
   	
  { 	
  	
  
	
   	
   	
   	
  "$pull" 	
   	
  : 	
  {"topicOfDiscussion"	
  : 	
  {"name":	
  data}},	
  
	
   	
   	
   	
  "$set" 	
   	
  : 	
  {"isIgnored" 	
  : 	
   	
  true}	
  
	
   	
   	
  },	
  
	
   	
   	
  { 	
  "mul?" 	
   	
  : 	
  true 	
   	
  }	
  
	
   	
  )	
  
	
  });	
  
	
  return	
  true	
  
	
  
Javascript – Caveats
•  Takes	
  up	
  read-­‐write	
  locks	
  on	
  the	
  en6re	
  database	
  
•  Can	
  be	
  run	
  with	
  {‘noLock’	
  :	
  true}	
  op?on	
  
	
  
	
  db.runCommand({	
  
	
   	
   	
  Eval:	
  <func?on>,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Args:	
  <args>,	
  
	
   	
   	
  Nolock:	
  <true/false>	
  
	
   	
   	
  })	
  
	
  
•  Can	
  be	
  replaced	
  by	
  mapreduce	
  in	
  most	
  cases	
  
	
  
•  Take	
  it	
  as	
  one-­‐off	
  case	
  
Indexes – Our use-cases
•  dropDups	
  
{dropDups	
  :	
  true}	
  
•  backGround	
  
{backGround	
  :	
  true}	
  
•  Time	
  to	
  Live	
  
{expireAUerSeconds	
  :	
  3600}	
  
•  Compound	
  Indexing	
  
{key1	
  :	
  1,	
  key2	
  :	
  1}	
  !=	
  {key1	
  :	
  1}	
  	
  
Our current state
•  Faster	
  write	
  opera?ons	
  
•  Under	
  high	
  data	
  load	
  from	
  different	
  sources	
  
•  Faster	
  read	
  opera?ons	
  
•  Graph	
  rendering	
  up-­‐to	
  10	
  x	
  quicker	
  
•  Ease	
  of	
  scalability	
  
•  Though	
  yet	
  to	
  reach	
  there	
  
Work In Progress
•  Full-­‐text	
  search	
  implementa?on	
  
•  can	
  be	
  created	
  only	
  on	
  strings	
  or	
  array	
  of	
  strings	
  
•  db.collec?onName.ensureIndex(	
  {	
  fieldName	
  :	
  "text"	
  }	
  )	
  
•  Capped	
  Collec?ons	
  
•  Widgets	
  for	
  last-­‐run	
  jobs	
  /	
  event	
  log	
  tables	
  
•  Very	
  fast	
  writes	
  possible	
  
•  db.createCollec?on("cName",	
  {	
  capped	
  :	
  true,	
  size	
  :	
  5242880,	
  
max	
  :	
  5000	
  }	
  )	
  
•  size	
  argument	
  is	
  always	
  required	
  
Tips / Tricks – Things we learnt
•  cloneCollec6on	
  
•  No	
  more	
  ssh/scp	
  to	
  remote	
  systems	
  
•  db.runCommand({cloneCollec?on:	
  <nsCollec?on>,	
  from:	
  <remote>,	
  query:	
  {}})	
  
•  db.cloneCollec?on(from,	
  collec?onName,	
  query)	
  
•  db.Collec-onName.copyTo	
  
•  doesn’t	
  not	
  copy	
  indexes	
  
Tips / Tricks – Things we learnt
•  remove()	
  vs	
  drop()	
  
•  Can’t	
  use	
  remove	
  for	
  capped	
  collec6ons	
  	
  
•  remove	
  keeps	
  indexes	
  while	
  drop()	
  clears	
  them	
  
•  To	
  remove	
  all	
  the	
  documents	
  in	
  a	
  collec?on,	
  use	
  drop()	
  
•  To	
  remove	
  beZer	
  part	
  of	
  large	
  collec?on,	
  use	
  javascript	
  
•  preZy()	
  find	
  by	
  default	
  
•  DBQuery.prototype._prejyShell	
  =	
  true	
  (	
  inside	
  your	
  ~/.mongorc.js)	
  
DEMO	
  
I	
  am	
  not	
  a	
  MongoDB	
  expert	
  though	
  J	
  
Thank	
  You!!	
  

More Related Content

What's hot

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchFlorian Hopf
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkDatabricks
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONAli Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONOutlyer
 
Where Is My Data - ILTAM Session
Where Is My Data - ILTAM SessionWhere Is My Data - ILTAM Session
Where Is My Data - ILTAM SessionTamir Dresher
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMongoDB
 
MongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMoshe Kaplan
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoWebinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoAltinity Ltd
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membaseArdak Shalkarbayuli
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDBMongoDB
 

What's hot (20)

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache Spark
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONAli Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLON
 
Where Is My Data - ILTAM Session
Where Is My Data - ILTAM SessionWhere Is My Data - ILTAM Session
Where Is My Data - ILTAM Session
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
MongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMongoDB Best Practices for Developers
MongoDB Best Practices for Developers
 
Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)Cassandra 2.0 (Introduction)
Cassandra 2.0 (Introduction)
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim TkachenkoWebinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 

Viewers also liked

Problema de upgrading de delay mínimo de árvore geradora mínima
Problema  de upgrading  de  delay  mínimo  de  árvore geradora mínimaProblema  de upgrading  de  delay  mínimo  de  árvore geradora mínima
Problema de upgrading de delay mínimo de árvore geradora mínimaUniversidade Federal do Maranhão
 
3 класс (урок 1)(1)
3 класс (урок 1)(1)3 класс (урок 1)(1)
3 класс (урок 1)(1)oksikboss
 
MMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single clickMMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single clickMongoDB APAC
 
2 класс (урок 1)
2 класс (урок 1)2 класс (урок 1)
2 класс (урок 1)oksikboss
 
1) dasar dasar programan web
1) dasar dasar programan web1) dasar dasar programan web
1) dasar dasar programan webImam Fathur
 
3 класс (урок 5)
3 класс (урок 5)3 класс (урок 5)
3 класс (урок 5)oksikboss
 
3 класс (урок 2)(1)
3 класс (урок 2)(1)3 класс (урок 2)(1)
3 класс (урок 2)(1)oksikboss
 
3 класс (урок 7)
3 класс (урок 7)3 класс (урок 7)
3 класс (урок 7)oksikboss
 
3 класс (урок 7.1)
3 класс (урок 7.1)3 класс (урок 7.1)
3 класс (урок 7.1)oksikboss
 
What's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by companyWhat's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by companyMongoDB APAC
 
247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangalore247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangaloreMongoDB APAC
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysMongoDB APAC
 
Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesignMongoDB APAC
 
Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.oksikboss
 

Viewers also liked (17)

Problema de upgrading de delay mínimo de árvore geradora mínima
Problema  de upgrading  de  delay  mínimo  de  árvore geradora mínimaProblema  de upgrading  de  delay  mínimo  de  árvore geradora mínima
Problema de upgrading de delay mínimo de árvore geradora mínima
 
3 класс (урок 1)(1)
3 класс (урок 1)(1)3 класс (урок 1)(1)
3 класс (урок 1)(1)
 
MMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single clickMMMS monitoring backup and management at a single click
MMMS monitoring backup and management at a single click
 
2 класс (урок 1)
2 класс (урок 1)2 класс (урок 1)
2 класс (урок 1)
 
1) dasar dasar programan web
1) dasar dasar programan web1) dasar dasar programan web
1) dasar dasar programan web
 
3 класс (урок 5)
3 класс (урок 5)3 класс (урок 5)
3 класс (урок 5)
 
3 класс (урок 2)(1)
3 класс (урок 2)(1)3 класс (урок 2)(1)
3 класс (урок 2)(1)
 
3 класс (урок 7)
3 класс (урок 7)3 класс (урок 7)
3 класс (урок 7)
 
3 класс (урок 7.1)
3 класс (урок 7.1)3 класс (урок 7.1)
3 класс (урок 7.1)
 
What's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by companyWhat's new in MongoDB 2.6 at India event by company
What's new in MongoDB 2.6 at India event by company
 
Learning And Earning
Learning And Earning Learning And Earning
Learning And Earning
 
Pelicamigrator
PelicamigratorPelicamigrator
Pelicamigrator
 
Rpsonmongodb
RpsonmongodbRpsonmongodb
Rpsonmongodb
 
247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangalore247 overviewmongodbevening-bangalore
247 overviewmongodbevening-bangalore
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdays
 
Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesign
 
Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.Урок 1. Введение в курс разработки сайтов. Web – технологии.
Урок 1. Введение в курс разработки сайтов. Web – технологии.
 

Similar to Buildingsocialanalyticstoolwithmongodb

Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Eventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldEventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldBeyondTrees
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Usability in the GeoWeb
Usability in the GeoWebUsability in the GeoWeb
Usability in the GeoWebDave Bouwman
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 

Similar to Buildingsocialanalyticstoolwithmongodb (20)

Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Eventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldEventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real World
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
JS Essence
JS EssenceJS Essence
JS Essence
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Usability in the GeoWeb
Usability in the GeoWebUsability in the GeoWeb
Usability in the GeoWeb
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Buildingsocialanalyticstoolwithmongodb

  • 1.              Building  Social  Analy/cs  Tool  with  MongoDB  -­‐   A  Developer's  Perspec/ve
  • 2. 1.  Product  Overview   2.  Why  MongoDB  for  us?   3.  Aggrega?on  Queries  to  the  rescue   4.  How  Javascript  helped  us?   5.  Experiences  with  Indexes   6.  In-­‐progress  use-­‐cases   7.  Tips  &  Tricks   8.  Demo   Agenda
  • 3. Abhishek  Tejpaul     SoUware  Developer  @  IntelliGrape  SoUware     Loves  Grails,  Git  and  Linux     abhishek@intelligrape.com   About me
  • 4. DataSiU   Instagram   Web   Crawler1   Web   Crawler…   mongoDB Product Overview – Information Flow
  • 8. •  Schema-­‐less  data.  Typical  data  sources     •  Adding  new  social  pla4orms  in  future   •  Needed  fast  read-­‐write  opera6ons   Why MongoDB for us?
  • 9. Aggregation Queries – Getting Insights •  Combina6on  of  queries  chained  together   •  At  every  stage,  we  can  filter/chain/massage  data     Image  credit:  h@ps://www.openshiC.com/blogs/an-­‐overview-­‐of-­‐whats-­‐new-­‐in-­‐mongodb-­‐22  
  • 10. Our use-case (esp. for graphs) •  Sen6ment  Analysis   •  Demographic  Analysis   •  Ar6cle  Analysis   •  Plan   •  Crea?on  of  Intelligence  tables  in  advance   •  Reality   •  On-­‐the-­‐fly  analysis  using  Aggrega6on  queries  
  • 11. How to go about it? •  Operates  on  a  single  collec6on     •  Think  about  data  you  have  and  insights  you  want   •  Focus  on  reducing  data  size  early  on   •  $match   •  $project   •  $sort   •  $limit,  $skip   •  Example db.collec?onName.aggregate(    {  "$match"  :  {  fieldName  :    matchingValue    },    {  "$project"  :  {    oldOrNewField:  fieldValue      }},    {  "$group"  :  {  fieldName  :  oldOrNewField,  "sum":  {"$sum":1}}},    {  "$sort"  :    {  "sum"  :  -­‐1  }},    {  "$limit"  :  20  })    
  • 12. Javascript Capabilities •  All  the  programming  capabili6es  of  Javascript  language  at  your   disposal   •  Taking  business  logic  /  processing  to  your  data-­‐store  
  • 13. Javascript – Our use-cases •  Remove  garbage  data  at  DB  level   •  Twijer  wrong  results   •  Filtering  out  STOP  keywords      db.IgnoreList.findOne().stopWords.forEach(  func?on(data)  {      db.ProcessedAr?cle.update(        {  "isAc?ve"  :  true,  "isIgnored"  :  {"$ne":true}  },          {            "$pull"    :  {"topicOfDiscussion"  :  {"name":  data}},          "$set"    :  {"isIgnored"  :    true}        },        {  "mul?"    :  true    }      )    });    return  true    
  • 14. Javascript – Caveats •  Takes  up  read-­‐write  locks  on  the  en6re  database   •  Can  be  run  with  {‘noLock’  :  true}  op?on      db.runCommand({        Eval:  <func?on>,                                                        Args:  <args>,        Nolock:  <true/false>        })     •  Can  be  replaced  by  mapreduce  in  most  cases     •  Take  it  as  one-­‐off  case  
  • 15. Indexes – Our use-cases •  dropDups   {dropDups  :  true}   •  backGround   {backGround  :  true}   •  Time  to  Live   {expireAUerSeconds  :  3600}   •  Compound  Indexing   {key1  :  1,  key2  :  1}  !=  {key1  :  1}    
  • 16. Our current state •  Faster  write  opera?ons   •  Under  high  data  load  from  different  sources   •  Faster  read  opera?ons   •  Graph  rendering  up-­‐to  10  x  quicker   •  Ease  of  scalability   •  Though  yet  to  reach  there  
  • 17. Work In Progress •  Full-­‐text  search  implementa?on   •  can  be  created  only  on  strings  or  array  of  strings   •  db.collec?onName.ensureIndex(  {  fieldName  :  "text"  }  )   •  Capped  Collec?ons   •  Widgets  for  last-­‐run  jobs  /  event  log  tables   •  Very  fast  writes  possible   •  db.createCollec?on("cName",  {  capped  :  true,  size  :  5242880,   max  :  5000  }  )   •  size  argument  is  always  required  
  • 18. Tips / Tricks – Things we learnt •  cloneCollec6on   •  No  more  ssh/scp  to  remote  systems   •  db.runCommand({cloneCollec?on:  <nsCollec?on>,  from:  <remote>,  query:  {}})   •  db.cloneCollec?on(from,  collec?onName,  query)   •  db.Collec-onName.copyTo   •  doesn’t  not  copy  indexes  
  • 19. Tips / Tricks – Things we learnt •  remove()  vs  drop()   •  Can’t  use  remove  for  capped  collec6ons     •  remove  keeps  indexes  while  drop()  clears  them   •  To  remove  all  the  documents  in  a  collec?on,  use  drop()   •  To  remove  beZer  part  of  large  collec?on,  use  javascript   •  preZy()  find  by  default   •  DBQuery.prototype._prejyShell  =  true  (  inside  your  ~/.mongorc.js)  
  • 21. I  am  not  a  MongoDB  expert  though  J