How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
MongoDB natively provides a rich analytics framework within the database. We will highlight the different tools, features and capabilities that MongoDB provides to enable various analytics scenarios ranging from AI, Machine Learning and applications. We will demonstrate a Machine Learning (ML) example using MongoDB and Spark.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
SQL vs NoSQL, an experiment with MongoDBMarco Segato
A simple experiment with MongoDB compared to Oracle classic RDBMS database: what are NoSQL databases, when to use them, why to choose MongoDB and how we can play with it.
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
MongoDB natively provides a rich analytics framework within the database. We will highlight the different tools, features and capabilities that MongoDB provides to enable various analytics scenarios ranging from AI, Machine Learning and applications. We will demonstrate a Machine Learning (ML) example using MongoDB and Spark.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
SQL vs NoSQL, an experiment with MongoDBMarco Segato
A simple experiment with MongoDB compared to Oracle classic RDBMS database: what are NoSQL databases, when to use them, why to choose MongoDB and how we can play with it.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Conceptos básicos. Seminario web 6: Despliegue de producciónMongoDB
Este es el último seminario web de la serie Conceptos básicos, en la que se realiza una introducción a la base de datos MongoDB. En este seminario web le guiaremos por el despliegue en producción.
Seminario realizado en el marco del master CANS en la Facultad de Informática de Barcelona.
Anatomia de una aplicación Web
Demasiadas escrituras en la BD, ¿qué puedo hacer?
¿Cómo puedo aprovechar el "Cloud"?
Optimizando aplicaciones Facebook
MongoDB has been conceived for the cloud age. Making sure that MongoDB is compatible and performant around cloud providers is mandatory to achieve complete integration with platforms and systems. Azure is one of biggest IaaS platforms available and very popular amongst developers that work on Microsoft Stack.
In this session we will present an overview from the point of view 'system that implementative on how to get the best performance from your drupal application.
We will also show examples of use cases for drupal scalable infrastructure.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB
Presented by Greg Deeds, CEO, Technology Exploration Group
Experience level: Introductory
A two person team using MongoDB and Salesforce.com created a geospatial machine learning tool from various datasets, parsing, indexing, and mapreduce in 24 hours. The amazing hack that beat 350 teams from around the world designer Greg Deeds will speak on getting to the winners circle with MongoDB power. It was MongoDB that proved to be the teams secret weapon to level the playing field for the win!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Conceptos básicos. Seminario web 6: Despliegue de producciónMongoDB
Este es el último seminario web de la serie Conceptos básicos, en la que se realiza una introducción a la base de datos MongoDB. En este seminario web le guiaremos por el despliegue en producción.
Seminario realizado en el marco del master CANS en la Facultad de Informática de Barcelona.
Anatomia de una aplicación Web
Demasiadas escrituras en la BD, ¿qué puedo hacer?
¿Cómo puedo aprovechar el "Cloud"?
Optimizando aplicaciones Facebook
MongoDB has been conceived for the cloud age. Making sure that MongoDB is compatible and performant around cloud providers is mandatory to achieve complete integration with platforms and systems. Azure is one of biggest IaaS platforms available and very popular amongst developers that work on Microsoft Stack.
In this session we will present an overview from the point of view 'system that implementative on how to get the best performance from your drupal application.
We will also show examples of use cases for drupal scalable infrastructure.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB
Presented by Greg Deeds, CEO, Technology Exploration Group
Experience level: Introductory
A two person team using MongoDB and Salesforce.com created a geospatial machine learning tool from various datasets, parsing, indexing, and mapreduce in 24 hours. The amazing hack that beat 350 teams from around the world designer Greg Deeds will speak on getting to the winners circle with MongoDB power. It was MongoDB that proved to be the teams secret weapon to level the playing field for the win!
Similar to Architecting Wide-ranging Analytical Solutions with MongoDB (20)
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
5. #MDBW16
How to Drive More Value From Data
?
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
6. #MDBW16
So Many Options
Part of image from: http://mattturck.com/wp-content/uploads/2016/01/matt_turck_big_data_landscape_full.png
7. #MDBW16
Why Are Analytics Important?
From http://www.bain.com/publications/capability-insights/advanced-analytics.aspx
8. #MDBW16
What Criteria To Consider For Choosing
Technology
• Assumption: you identified what derived data/analytic(s) has ROI
• Criteria
• Operations on data (read/write, transform, aggregation, algorithm)
• Time SLA – both how up-to-date data is and response times
• Effort (training, development, management)
• Processing model for analytic (partitionable, iterative, streaming, etc.)
• Cost (data duplication, memory, servers, software)
10. #MDBW16
MongoDB Capabilities to Highlights for Analytics
Community/Open Source
1. Aggregation Framework
2. Reading from secondaries (priority = votes = 0 recommended)
3. Mongo Connector – replication to other MongoDB, search engines, etc.
4. Hadoop Connector – exposes MongoDB as native input/output for Hive, Pig,
MR, etc.
5. Spark Connector – exposes MongoDB as an RDD/DataFrame/DataSet for
read/write
Enterprise Advanced
1. In-memory storage engine – now GA for production use
2. BI Connector – BI & SQL read access to MongoDB
12. #MDBW16
Aggregation Pipeline Stages
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer joins
• $unwind
Expand arrays in documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset of
documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection
13. #MDBW16
Aggregation With a Sharded Database
Workload split between shards
1. Client works through mongos as
with any query
2. Shards execute pipeline up to a
point
3. A single shard merges cursors and
continues processing
4. $lookup & $out performed within
Primary shard for the database
16. #MDBW16
On-Demand Analytics with Agg FW
Benefits
1. Up-to-date data
2. One technology
3. Only raw data stored
4. Flexible
Tradeoff
1. Slow if scanning many
documents
Common Uses
Groups, counts, sum,
averages for small subsets
of data
Aggregation
Framework
Runtime
agg pipeline
Results in real-time
Application
17. #MDBW16
Offline Analytics With Aggregation Framework
Benefits
1. One technology
2. Can filter at DB on
aggregations
3. Low latency (in C++)
Tradeoffs
1. Storing additional data
2. One thread per
server/instance
3. Advanced functions not
included
Common Uses
1. Pre-calculating values
across dataset
2. Batch transformations
Aggregation
Framework
$out:
“results”
*Agg Pipeline
Application
* MapReduce also possible but slower (run in Javascript) and most requirements can be done in agg fw
Outputting to a sharded collection with agg fw would be returned to driver and written from there to sharded collection
Also can return
data to application
18. #MDBW16
Microsharding for Highly Parallel Processing
Benefits
1. Multiple threads for agg
fw query per server
2. One technology
Tradeoffs
1. # of parallel threads and
partitions in DB
predefined
2. No native job scheduling
or resource
management
Common Uses
Analytics on large result sets
to minimize latency
Agg
pipeline
…
Mongos
Run in parallel
on N partitions
Data returned
In parallel
Application
Each server
20. #MDBW16
Analytics in Custom Application/Framework
Benefits
1. Flexible & in app team control
2. All language libraries &
frameworks available
3. Tailing oplog gives near real-
time
Tradeoffs
1. Data might not fit in memory
2. Threading managed by
developer
Common Uses
1. Statistical analysis w/ R, Matlab,
etc.
2. Advanced analytics & algos
3. Updating counts & aggregations
Query raw data
Results in real-time
Application
Optionally store analyzed
data back in DB
Can use tailable
cursor for tracking
events
21. #MDBW16
Documents
returned
SQL result sets
returned
Analytics in 3rd-Party Products
BI or other
analytics
product
Benefits
1. Pre-built UI and toolkits
2. Supports most all 3rd party
SQL-based tools
3. Can migrate to MongoDB &
keep reporting tools
Tradeoffs
1. Optimal performance often
requires configuring views
2. Joins between 2 sharded
collections can be slow
Common Products
1. Pentaho, Jaspersoft, Alteryx
2. Tableau, Qlikview,
SQL Query MongoDB BI
Connector
MongoDB
Query
Native Integrations
24. #MDBW16
Partitionable Distributed Analytics
Benefits
1. Very parallelizable to
scale horizontally
2. Intermediate results can
be on disk, not necessarily
memory
Tradeoff
1. Often significant overhead
in learning the framework
Common Frameworks
1. Hadoop
2. Spark
…
Partitions
lined up
between
workers &
shard
Worker
Worker
Worker
…
Mongos
Mongos
Mongos
Master
Worker Mongos
26. #MDBW16
Iterative Distributed Analytics
Benefits
1. Great for machine
learning
2. Memory-based
frameworks can be
much faster
Tradeoff
1. Harder overall to speed
up with horizontal
scaling
Common Framework
1. Spark
…
Stages of iterations might be
partitionable
Worker
…
Mongos
Master
Worker Mongos
28. #MDBW16
Streaming Distributed Analytics
Benefits
1. Analysis on current data
2. Can analyze
incrementally to avoid
batch windows
3. Can use some
frameworks for
streaming + batch
Tradeoff
1. Depends on streaming
sources being available
2. Some analytics cannot be
calculated incrementally
Common Uses &
Frameworks
1. Sentiment analysis
2. Spark Streaming, Storm,
Flink, Kafka Streams
Stream
Processing
Framework
Event
Sources
Storing events &
analytic results
Historical or
reference data
on-demand
Tailable cursor
Stream
Processing
Framework
…
30. #MDBW16
Given Users’ ratings
for some Items, how
to infer users’
ratings for all items
Useful for:
1. Recommendation
s
2. Cross-sell
3. Accurate
targeting
Recommendation Engine Problem Description
Image from: https://www.mapr.com/ebooks/spark/08-recommendation-engine-spark.html
31. #MDBW16
Alternating Least Squares (ALS) Algo
Image from http://netprophetblog.blogspot.com/2013/10/local-regression.html
2-dimensional
Given f(x) = a*x + b
Can minimize
di = Σi (yi – f(xi))2
ALS approach
Fix a and solve for b
Alternate: fix b, and solve for
a
ALS can extend to n-
dimensional
33. #MDBW16
Architecture of Solution
Spark
Worker
Spark Master
Spark
Worker
Pushes
ALSExampleMongoDB to
Workers
Each worker
handles partitions of
data as appropriate
and also shuffle
Worker reads its partition of
User ratings for Items from
MongoDB
Worker writes its partition of
data for predictions back to
MongoDB
On startup, shared libraries
loaded by Workers
1. MongoDB Spark
Connector
2. Java Driver
Full code for example can be found at:
https://github.com/matthewkalan/mongo-spark-recommender-example
34. #MDBW16
Code for Configuration and Reading from
MongoDB
object ALSExampleMongoDB {
def main(args: Array[String]) {
//this conf should only be used when run locally because sc.getOrCreate() reuses already running SparkContexts
val sc = SparkContext.getOrCreate()
val sqlContext = SQLContext.getOrCreate(sc)
var inputUri = args(1) //pass MongoDB connection string from args
//setting up DataFrame to read from MongoDB - Connector automatically partitions the data to spread across workers
var ratingsAll = sqlContext.read.options(
Map(
"uri" -> inputUri
//"localThreshold" -> "0", //Add these two parameters to connect to the nearest Mongos, if desired
//"readPreference.name" -> "nearest",
//"partitionerOptions.partitionSizeMB" -> ”128", //Typically partitions should be 64 - 512 MB
//"partitioner" -> "MongoSamplePartitioner" //If customer partitioner desired
)).mongo()
var userIdThreshold = args(3)
ratings = ratingsAll.filter(ratingsAll("userId") > userIdThreshold) //Filtering & aggregation pushed down to DB w/ indexes
//caching the DataFrame in memory of Spark workers
ratings.cache()
35. #MDBW16
Code for Training ALS Algo and Making
Predictions
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2)) //split into a training and test dataset
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training) //train the model
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
.withColumn("rating", col("rating").cast(DoubleType))
.withColumn("prediction", col("prediction").cast(DoubleType))
//remove NaN values if a user is not in both the training and test dataset
val predictionsValidUsers = predictions.na.drop("any", Seq("rating", "prediction"))
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictionsValidUsers)
36. #MDBW16
Code for Writing Predictions to MongoDB
//store the users predictions back into MongoDB
var outputUri = args(2)
MongoSpark.save(predictionsValidUsers.write.option("uri", outputUri))
//calculate and print running time in seconds
val endTime = Calendar.getInstance().getTime()
var elapsedTime = (endTime.getTime() - startTime.getTime()) / 1000
39. #MDBW16
Start simple, expand as required
1. Aggregation
Framework
2. Language libraries
3. 3rd Party Products
4. Distributed
Processing
Frameworks
Light bulb image from: http://smallbusinessbc.ca/article/five-ways-discover-additional-value-your-business/business-value-idea/
40. #MDBW16
Resource Location
MongoDB Connector for Spark github.com/mongodb/mongo-spark
Spark ALS Recommendation Engine Example
github.com/matthewkalan/mongo-spark-recommender-
example
Blog: Future Big Data Architecture - Delivering on the Data
Lake Vision
www.mongodb.com/blog/post/the-future-of-big-data-
architecture
White Paper: Unlocking Operational Intelligence from the
Data Lake
www.mongodb.com/collateral/unlocking-operational-
intelligence-from-the-data-lake
Blog: Using MongoDB with Hadoop
www.mongodb.com/blog/post/using-mongodb-hadoop-
spark-part-1-introduction-setup
Free Online Training university.mongodb.com
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
For More Information
Editor's Notes
Explain I mean a broad definition for analytics, really any derived data
Addresses: is MongoDB enough? Should I be using other products in addition?
Poll audience for what analytics they are considering
For aggregation operations that run on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard to merge the results and avoid overloading the primary shard for that database. Aggregation operations that require running on the database’s primary shard are the $out stage and $lookup stage.
Note: place before the scenario that deals with this & remove some bullets
Replica set or shards are hidden in the database icon
Application uses a programming language driver to send agg pipeline
Example: total balance or value of customer, total number of posts, esp. for a given entity (i.e. can filter) and NOT for the whole database
Obviously the simplest and most common, get the data you want and then call a library in the application
Example: good for pre-calculating totals and aggregations, e.g. balances, documents, dollar values, etc.
If need to generate bulk reports, could send the data back to the reporting tool
Example: good for longer running jobs, e.g. Top 10 Bank, has a personal in-memory data mart with 2GB allocated per person for report data (from their 2PB DW) spread across all shards so queried in parallel
Note: Be sure to explain an easily digestible example and point out it is not a common pattern
This can be on each server or you can shard across instances to get parallelism – the main concept here is sharding earlier than otherwise necessary to get parallelism in analytical processing
Previous slides were focused on agg fw
Point this out because some hear analytics and think Hadoop/Spark maybe – but there are many libraries and analytics in Java, Python, R, etc.
If data can be filtered well, the latency should be similar for analytic in app vs. agg fw (difference between C++ and language in use)
Example: Using R, Matlab, and other statistical packages directly against MongoDB
Example are SAS, Tableau, etc. or any tool that is read-only from the DB
Point out could even run the Workers on the same server as each MongoDB node, but have to know in advance how big an instance to use. Having the Worker node separate (and it is stateless) allows the Worker to be sized dynamically depending on the job
Point out could even run the Workers on the same server as each MongoDB node, but have to know in advance how big an instance to use. Having the Worker node separate (and it is stateless) allows the Worker to be sized dynamically depending on the job
Can be combined with partitionable portions of algo so that each iteration is partitionable. Then in-memory and distribution are important