• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Nosql part3
 

Nosql part3

on

  • 181 views

Weekend Business Analytics Praxis

Weekend Business Analytics Praxis

Statistics

Views

Total Views
181
Views on SlideShare
181
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Nosql part3 Nosql part3 Presentation Transcript

    • NoSQL & MongoDB..Part III Arindam Chatterjee
    • Aggregation in MongoDB • • • • • Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the mongod instance simplifies application code and limits resource requirements. Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents. In MongoDB aggregations are implemented using – Aggregation Pipeline – Map-Reduce
    • Aggregation in MongoDB • • • • • Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the mongod instance simplifies application code and limits resource requirements. Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents. In MongoDB aggregations are implemented using – Aggregation Pipeline – Map-Reduce
    • Aggregation Pipeline
    • Map Reduce • • • • • MongoDB applies the map phase to each input document (i.e. the documents in the collection that match the query condition). The map function emits key-value pairs. For those keys that have multiple values, MongoDB applies the reduce phase, which collects and condenses the aggregated data. MongoDB then stores the results in a collection. MongoDB supports sharded collections both as input and output.
    • Map Reduce Illustration
    • Map Reduce
    • Map Reduce..more example • Insert data in collection “orders” as follows – db.orders.insert({ _id: ObjectId("50a8240b927d5d8b5891743c"), cust_id: "abc123", ord_date: new Date("Oct 04, 2012"), status: 'A', price: 25, items: [ { sku: "mmm", qty: 5, price: 2.5 }, { sku: "nnn", qty: 5, price: 2.5 } ] }); • Task: Find the total price per customer • Step I: Define map function that emits “cust_id” and “price” pair • var mapFunction1 = function() { emit(this.cust_id, this.price); };
    • Map Reduce..more example..2 • Define Reduce function with two arguments keyCustId and valuesPrices – The valuesPrices is an array whose elements are the price values emitted by the map function and grouped by keyCustId. – The function reduces the valuesPrices array to the sum of its elements. • var reduceFunction1 = function(keyCustId, valuesPrices) { return Array.sum(valuesPrices); } • Perform the map-reduce on all documents in the orders collection using the mapFunction1 map function and the reduceFunction1 reduce function. – db.orders.mapReduce( mapFunction1, reduceFunction1, { out: "map_reduce_example" } ) • Do a find() to check the new collection “map_reduce_example” – db.map_reduce_example.find();
    • Full Text Search in MongoDB • Important Concepts – Stop Words: filter words that are irrelevant for searching. Examples are is, at, the, am, I, your etc. – Stemming: process of reducing words to their root, base .E.g. “waiting”, “waited”, “waits” have the same root “wait” • Example: I am your father, Luke – “I”, “am”, “your” are Stop Words – After removing the Stop Words, the words left are “father” and “Luke” – These are processed in next step
    • Text Search process in MongoDB • • • Tokenizes and stems the search term(s) during both the index creation and the text command execution. Assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query. By default, the text command returns at most the top 100 matching documents as determined by the scores.
    • Full Text Search in MongoDB..Example • While starting the MongoDB server, use the following parameters – mongod --setParameter textSearchEnabled=true • Create a text Index on Collection “txt” – db.txt.ensureIndex( {txt: "text"} • To show up the text index use the following – db.txt.getIndices() • Insert data in collection “txt” – db.txt.insert( {txt: "I am your father, Luke"} ) • Stop word filtering has already happened. The following command shows only 2 keys in the index txt.txt.$txt_text – db.txt.validate() • Perform a Full Text Search using the following – db.txt.runCommand( "text", { search : "father" } )
    • Text Analytics
    • What is Text Analytics • Process of identifying meaningful information from unstructured content Social Media Analytics : Facebook, Twitter What do people Feel about the latest movie? What is our competitor doing in market? What is the response to the Last ad campaign? What is the sentiment of people in the organization What are People feeling about the new brand of product
    • Text Analytics..2 Email Analytics Log Analytics • Customer Support • Regulatory Compliance • IT Server Log
    • Text Analytics..3 Fraud Detection Analytics • Insurance Claims • Credit Card Transactions • Tax Return claims
    • Text Analytics: Scenarios • • Obtain reviews from various blogs, review sites about a new movie Highlight important viewer’s comments on the movie In the process, the Text Analytics engine performs the following • • • • Understand human language Understand Positive vs. Negative comments Identify sarcasm, criticism, pun Tries to interpret like a human being
    • Sentiment Analysis of the movie Krrish 3 (Hindi) (U) Krrish 3 (2013) 152 min - Action 6.5 Your rating: 6.5 November 2013 (India) 6.5/10 Ratings: 6.5/10 from 6,762 users Reviews: 135 user | 26 critic Krrish and his scientist father have to save the world and their own family from an evil man named Kaal and his team of human-animal mutants led by the ruthless Kaya. Will they succeed? How? Director: Rakesh Roshan Writers: Robin Bhatt (screenplay), Honey Irani (screenplay), 5 more credits » Stars: Priyanka Chopra, Hrithik Roshan, Amitabh Bachchan | See full cast and crew » “Wish I were 12 again”, Author: shahin mahmud 1 November 2013 “Plagiarism..Plagiarism... Everywhere” Author: venugopal19196 from Guntur 2 November 2013 “Krrish ek soch hain jo hum tak nahi pahunch paye” Author: darkshadowsxtreme from India 4 November 2013 “Far below expectations”, Author: Arpan Mallik from India 3 November 2013 “Krrish 3: No more than a mere rubbish..” Author: amruthvvkp from India 3 November 2013
    • Text Analytics: Information Extraction • • Distill structured data from unstructured and semi-structured text Exploit the extracted data in your applications Noun Krish 3 Rakesh Roshan Priyanka Chopra Hrithik Roshan Amitabh Bacchan Robin Bhatt Honey Irani Unstructured content Adjective good worst more below Comment “Krrish ek soch hain jo hum tak nahi pahunch paye" "rubbish" "plagiarism" Text Extraction Engine Extraction logic Structured Content
    • Text Analytics: Information Extraction..2 Pattern Recognition Entities and Relations • Phone numbers • Person • Date formats • Email addresses • URL • Location • Organization • Association between entities Linguistic Annotation Others • Tokenization • Topic identification • Sentiment / Opinion • Classification • Ontology • Parts of Speech • Normalization • Co-reference resolution
    • Text Analytics Terminology • RegEx: Regular expression to recognize patterns of text, e.g. Phone number • Dictionaries: A list of entries containing domain specific terms. Example: dictionary of city names, dictionary of IT companies • Text Extraction Script: A script that uses dictionaries and regex on a set of text documents and performs extraction of text. Example: GATE Extractor program • Annotation: A labeled text, matching a particular criteria. Example: Person name Precision: Measure of exactness or accuracy of pattern recognition program Recall: Measure of completeness • • The higher the precision and recall, the better the program is
    • Text Analytics Approaches • Grammar based – Input text viewed as a sequence of tokens – Rules expressed as regular expression patterns over these tokens • Algebra based – Extract SPANs matching a dictionary or regex – Create an operator for each basic operation – Compose operators to build complex extractors
    • MongoDB as Analytics Platform • • The flexibility of MongoDB makes it perfect for storing analytics. Customers have different types of analytics engines on MongoDB platform like – usage metrics, – business domain specific metrics, – financial platforms. • • The most generic type of metrics that most clients start tracking are events (e.g. “how many people walked into my stores” or “how many people opened an iPhone application”). The queries to support the above questions should be efficient in a distributed environment
    • MongoDB as Analytics Platform…2 • Example: Insert data as follows – { store_id: ObjectId(), // Object id of a store event: "door open", // will be one of "door opened", "sale made", or "phone calls" created_at: new Date("2013-01-29T08:43:00Z") } • To run a query on the event, store_id, and created_at, you run the following query. – db.events.find({store_id: ObjectId("aaa"), created_at: {$gte: new Date("2013-01-29T00:00:00Z"), $lte: new Date("2013-01-30T00:00:00Z")}}) • The above query runs fast in local environment but is painfully slow in a distributed environment having large database • Multiple compound indexes are created to increase speed. – db.events.ensureIndex({store_id: 1, created_at: 1}) db.events.ensureIndex({event: 1, created_at: 1}) db.events.ensureIndex({store_id: 1, event: 1, created_at: 1} )
    • MongoDB as Analytics Platform…2 • Achieving Optimization – Each of the indexes should fit into the RAM – Any new document will have a seemingly randomly chosen “store_id”. – An insert command will have a high probability of inserting the document record to the middle of an index. – To minimize RAM usage, it is best to insert sequentially: termed “writing to the right side of the index”. – Any new key is greater than or equal to the previous index key.
    • MongoDB as Analytics Platform…3 • Achieving Optimization using “time bucket” – Create a time_bucket attribute that breaks down acceptable date ranges to hour, day, month, week, quarter, and/or year. { store_id: ObjectId(), // Object id of a store event: "door open", created_at: new Date("2013-01-29T08:43:00Z"), time_bucket: [ "2013-01-29 08-hour", "2013-01-29-day", "2013-04-week", "2013-01-quarter", "2013-year” ]} "2013-01-month", – Create the following indexes db.events.ensureIndex({time_bucket: 1, store_id: 1, event: 1}) db.events.ensureIndex({time_bucket: 1, event: 1}) – Instead of running the query on entire range, run the following db.events.find({store_id: ObjectId("aaa"), "time_bucket": "2013-01-29-day"})
    • MongoDB as Analytics Platform…4 • Benefit of “time bucket” – Using the optimized time_bucket, new documents are added to the right side of the index. – Any inserted document will have a greater time_bucket value than the previous documents. – By adding to the right side of the index and using time_bucket to query, Mon-goDB will swap to disk any rarely older doc-u-ments resulting in minimal RAM usage. – The “hot data” size will be the most recently accessed (typically 1- 3 months with most analytics applications), and the older data will settle nicely to disk. – Nei-ther queries nor inserts will access the middle of the index, and older index chunks can swap to disk.
    • Thank You