Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
The first task of the project is to generate a set of more than one million data points to be used as input for the k-means clustering algorithm. Next k-means algorithm is implemented following the MapReduce framework.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
The first task of the project is to generate a set of more than one million data points to be used as input for the k-means clustering algorithm. Next k-means algorithm is implemented following the MapReduce framework.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
YouTube Link: https://youtu.be/mHezNgNBnuA
** Python Certification Training: https://www.edureka.co/python **
This Edureka PPT on 'Date and Time in Python' will train you to use the datetime and time modules to fetch, set and modify date and time in python.
Below are the topics covered in this PPT:
The time module
Built-in functions
Examples
The datetime module
Built-in functions
Examples
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline
Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli
In this talk, I overview Trill, describe two projects that expand Trill's functionality, and describe Quill, a new multi-node offline analytics system I have been working on at MSR.
Weather of the Century: Design and PerformanceMongoDB
This talk walks you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application.
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
GeoMesa is an open-source toolkit for processing and analyzing spatio-temporal data, such as IoT and sensor-produced observations, at scale. It provides a consistent API for querying and analyzing data on top of distributed databases (e.g. HBase, Accumulo, Bigtable, Cassandra) and messaging networks (e.g. Kafka) to handle batch analysis of historical archives of data and low-latency processing of data in-stream.
GeoMesa has deep integration with Spark SQL. It has added spatial types (e.g. Point, LineString, Polygons), spatial predicates (st_contains, st_intersects, etc.), and geometry processing functions (e.g. st_buffer, st_convexHull, etc.) to Spark SQL. It also optimizes the processing of these extensions by integrating with the Catalyst SQL optimizer to intercept SQL statements with spatial predicates and provision RDDs based on the underlying spatial index.
This session will describe the implementation of the GeoMesa Spark SQL integration, illustrate its application in production systems and demonstrate spatial aggregations and analytics using map-based visualizations.
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
This paper proposes RESIN, an optimizer extension that
adds first-class support for map-reduce reasoning. RESIN uses a novel technique called Generalized Sub-Query Fusion that identifies sub-queries computing on overlapping data, and
fuses them into the same map-reduce stages. The analysis
is general; it does not require that the sub-queries be syntactically the same, nor are they required to produce the same output. Sub-query fusion allows RESIN to sometimes also eliminate expensive binary operations like Joins and Unions altogether for further gains.
Paper: https://www.usenix.org/system/files/osdi20-sarthi_0.pdf
Pyclustering library tutorial. Theme: BANG algorithm. Brief algorithm description, available features in the library, code examples and demonstration of clustering/segmentation results.
Cost optimization problem where a manufacturing firm has entered into the contract to supply 50 refrigerators at the end of the first month, 50 at the end of the second month and 50 at the end of third. The cost of producing x refrigerators in any month is given by $ (x2 + 1000). The firm can produce more number of refrigerators and can carry them to subsequent month. It cost $20 per unit for any refrigerator to be carried from one month to the next one
presentation I gave on GaianDB - a dynamic federated distributed database available on IBM alphaWorks
The presentation wont make a lot of sense without speaker notes... which I've not written yet. Sorry about that.
In the context of this assignment on Mongo, queries will be designed and executed on a mongo collection, simple operations on mongo will be executed with python while mapreduce jobs will also be designed and executed on a mongo collection.
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
YouTube Link: https://youtu.be/mHezNgNBnuA
** Python Certification Training: https://www.edureka.co/python **
This Edureka PPT on 'Date and Time in Python' will train you to use the datetime and time modules to fetch, set and modify date and time in python.
Below are the topics covered in this PPT:
The time module
Built-in functions
Examples
The datetime module
Built-in functions
Examples
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline
Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli
In this talk, I overview Trill, describe two projects that expand Trill's functionality, and describe Quill, a new multi-node offline analytics system I have been working on at MSR.
Weather of the Century: Design and PerformanceMongoDB
This talk walks you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application.
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
GeoMesa is an open-source toolkit for processing and analyzing spatio-temporal data, such as IoT and sensor-produced observations, at scale. It provides a consistent API for querying and analyzing data on top of distributed databases (e.g. HBase, Accumulo, Bigtable, Cassandra) and messaging networks (e.g. Kafka) to handle batch analysis of historical archives of data and low-latency processing of data in-stream.
GeoMesa has deep integration with Spark SQL. It has added spatial types (e.g. Point, LineString, Polygons), spatial predicates (st_contains, st_intersects, etc.), and geometry processing functions (e.g. st_buffer, st_convexHull, etc.) to Spark SQL. It also optimizes the processing of these extensions by integrating with the Catalyst SQL optimizer to intercept SQL statements with spatial predicates and provision RDDs based on the underlying spatial index.
This session will describe the implementation of the GeoMesa Spark SQL integration, illustrate its application in production systems and demonstrate spatial aggregations and analytics using map-based visualizations.
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
This paper proposes RESIN, an optimizer extension that
adds first-class support for map-reduce reasoning. RESIN uses a novel technique called Generalized Sub-Query Fusion that identifies sub-queries computing on overlapping data, and
fuses them into the same map-reduce stages. The analysis
is general; it does not require that the sub-queries be syntactically the same, nor are they required to produce the same output. Sub-query fusion allows RESIN to sometimes also eliminate expensive binary operations like Joins and Unions altogether for further gains.
Paper: https://www.usenix.org/system/files/osdi20-sarthi_0.pdf
Pyclustering library tutorial. Theme: BANG algorithm. Brief algorithm description, available features in the library, code examples and demonstration of clustering/segmentation results.
Cost optimization problem where a manufacturing firm has entered into the contract to supply 50 refrigerators at the end of the first month, 50 at the end of the second month and 50 at the end of third. The cost of producing x refrigerators in any month is given by $ (x2 + 1000). The firm can produce more number of refrigerators and can carry them to subsequent month. It cost $20 per unit for any refrigerator to be carried from one month to the next one
presentation I gave on GaianDB - a dynamic federated distributed database available on IBM alphaWorks
The presentation wont make a lot of sense without speaker notes... which I've not written yet. Sorry about that.
In the context of this assignment on Mongo, queries will be designed and executed on a mongo collection, simple operations on mongo will be executed with python while mapreduce jobs will also be designed and executed on a mongo collection.
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
Computer Science CS Project Matrix CBSE Class 12th XII .pdfPranavAnil9
The following project is based on Matrices and Determinants. It is a menu based program with data file and SQL Connectivity. The program is capable of performing all the complex functions of matrices and determinants that are mentioned in the Class 12th Math’s book. The ‘Menu’ of the program upon which it executes is as follows:
1: Generate a Random Matrix
2: Addition
3: Subtraction
4: Multiplication by a Scalar
5: Multiplication by a Matrix
6: Calculate Determinant
7: Calculate Minor
8: Calculate Cofactor
9: Calculate Adjoint
10: Transpose
11: Inversion
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Python for Data Science is a must learn for professionals in the Data Analytics domain. With the growth in IT industry, there is a booming demand for skilled Data Scientists and Python has evolved as the most preferred programming language. Through this blog, you will learn the basics, how to analyze data and then create some beautiful visualizations using Python.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
MongoDB Project: Relational databases to Document-Oriented databases
1. mongoDB Project
Relational databases & Document-Oriented databases
Athens University of Economics and Business
Dpt. Of Management Science and Technology
Prof. Damianos Chatziantoniou
| lkoutsokera@gmail.com
| stratos.gounidellis@gmail.com
Lamprini Koutsokera
Stratos Gounidellis
BDSMasters
2. SQL Server vs. mongoDB
2
Description Microsoft’s relational DBMS One of the most popular
document stores
Database model Relational DBMS Document store
Implementation language C++ C ++
Data scheme yes schema-free
Triggers yes no
Replication methods yes, depending the SQL-Server Edition Master-slave replication
Partitioning methods tables can be distributed across Sharding
several files, sharding through
federation
References: [1]
5. Required installations
5
References: [2]
1. Determine which MongoDB build you need.
2. Download MongoDB for Windows
3. Install MongoDB Community Edition.
4. Set up the MongoDB environment.
Mongodb installation
Required python packages installation
7. Python coding [1] – table parsing
7
Part 1 - Queries and the Aggregation Pipeline [1]
. . .
load() function
mongo shell
prep.js file
References: [3]
8. Python coding [1] – table parsing
8
Query 1 : How many students in your database are currently taking at least 1 class (i.e. have a class with
a course_status of “In Progress”)?
Part 1 - Queries and the Aggregation Pipeline [2]
Query 2 : Produce a grouping of the documents that contains the name of each home city and the
number of students enrolled from that home city.
9. Python coding [1] – table parsing
9
Query 3 : Which hobby or hobbies are the most popular?
Part 1 - Queries and the Aggregation Pipeline [3]
the most popular
the top 5 popular
10. Python coding [1] – table parsing
10
Query 4 : What is the GPA (ignoring dropped
classes and in progress classes) of the best student?
Part 1 - Queries and the Aggregation Pipeline [4]
Query 5 : Which student has the largest number of
grade 10’s?
11. Python coding [1] – table parsing
11
Query 6 : Which class has the highest average
GPA?
Part 1 - Queries and the Aggregation Pipeline [5]
Query 7 : Which class has been dropped the most
number of times?
12. Python coding [1] – table parsing
12
Query 8 : Produce of a count of classes that have been COMPLETED by class type. The class type is found
by taking the first letter of the course code so that M102 has type M.
Part 1 - Queries and the Aggregation Pipeline [6]
13. Python coding [1] – table parsing
13
Query 9 : Produce a transformation of the documents so that the documents now have an additional boolean
field called “hobbyist” that is true when the student has more than 3 hobbies and false otherwise.
Part 1 - Queries and the Aggregation Pipeline [7]
14. Python coding [1] – table parsing
14
Query 10 : Produce a transformation of the documents so that the documents now have an additional field that
contains the number of classes that the student has completed.
Part 1 - Queries and the Aggregation Pipeline [8]
15. Python coding [1] – table parsing
15
Query 11 : Produce a transformation of the documents in
the collection so that they look like the following output.
The GPA is the average grade of all the completed
classes. The other two computed fields are the number of
classes currently in progress and the number of classes
dropped. No other fields should be in there. No other fields
should be present.
Part 1 - Queries and the Aggregation Pipeline [9]
16. Python coding [1] – table parsing
16
Query 12 : Produce a NEW collection (HINT: Use $out in the aggregation pipeline) so that the new documents
in this correspond to the classes on offer. The structure of the documents should be like the following output.
The _id field should be the course code.
The course_title is what it was before. The numberOfDropouts is the number of students who dropped out. The
numberOfTimesCompleted is the number of students that completed this class. The currentlyRegistered array is
an array of ObjectID’s corresponding to the students who are currently taking the class. Finally, for the students
that completed the class, the maxGrade, minGrade and avgGrade are the summary statistics for that class.
Part 1 - Queries and the Aggregation Pipeline [10]
17. Python coding [1] – table parsing
17
Part 1 - Queries and the Aggregation Pipeline [11]
18. Python coding [1] – table parsing
18
Part 2 - Python & MongoDB [1]
python_mongodb.py: Implement simple operations on mongo database.
Connect to mongo database and collection. Connect to mongo database and collection and insert
a record.
19. Python coding [1] – table parsing
19
Part 2 - Python & MongoDB [2]
Connect to mongo database and collection and insert
multiple records.
Connect to mongo database and collection and print
its content.
python_mongodb.py: Implement simple operations on mongo database.
20. Python coding [1] – table parsing
20
Part 2 - Python & MongoDB [3]
Connect to mongo database and collection and update
Its documents.
Connect to mongo database and collection and print
specific field.
python_mongodb.py: Implement simple operations on mongo database.
21. Python coding [1] – table parsing
21
Part 2 - Python & MongoDB [4]
Connect to mongo database and collection and convert the
collection to a dataframe.
Connect to mongo database and collection and import
data from a dataframe.
python_mongodb.py: Implement simple operations on mongo database.
22. Python coding [1] – table parsing
22
Part 2 - Python & MongoDB [5]
1. Clone this repository:
2. Install the required python packages.
3. Run python_mongodb.py to implement basic operations (insert_one, insert_many,
update, delete_one, delete_many, etc.) on mongodb.
python_mongodb.py: Implement simple operations on mongo database.
24. Python coding [1] – table parsing
24
Part 3 - MapReduce (Word Count) [1]
MapReduce 1 : Write a map reduce job on the
students collection similar to the classic word
count example. More specifically, implement a
word count using the course title field as the text.
In addition, exclude stop words from this list. You
should find/write your own list of stop words. (Stop
words are the common words in the English
language like “a”, “in”, “to”, “the”, etc.)
References: [4,5]
27. Python coding [1] – table parsing
27
Part 3 - MapReduce (Average grade) [1]
MapReduce 2 : Write a map reduce job on the students
collection whose goal is to compute average
GPA scores for completed courses by
home city and by course type (M, B, P, etc.).
References: [4,5]