MAS.500 - Software Module - Rahul Bhargava 
Data 
Management 
2014.11.21
Topics 
❖ Regular Expressions (online quickstart) 
❖ Databases 
❖ History 
❖ Relational modeling 
❖ Sql (mysql quickstart) 
❖ Keys/Indexes 
❖ No-sql (couchdb quickstart) 
❖ Behind the Scenes with Ed Platt 
❖ Homework
Regular Expressions
Regular Expressions 
(RegEx/grep) 
❖ Match a string of text by defining a pattern 
❖ Useful for cleaning up or identifying data 
❖ “Find” Demo on http://regexpal.com 
❖ “Find/Replace” Demo with 
http://www.sugarscript.com/findandreplace/index.php 
❖ Interested? Interactive tutorial on http://regexone.com
Databases
Database History 
❖ List-based 
❖ Follow link from one record to another (linked-list) 
❖ File-system data stores 
❖ Based on filenaming convention, limited by file i/o 
speeds 
❖ Generic data storage and management 
❖ Relational modeling or entities and relationships 
(ER)
Relational Modeling: In 
English 
❖ A Group has many People 
❖ A Person belongs to one Group 
❖ A Group has many Projects 
❖ A Project belongs to one Group 
❖ A Person has many Projects 
❖ A Project has many People
Relational Modeling: Diagram 
many 1 
Person Group 
Project 
1 
many 
many 
many
Relational Modeling: Tables 
Group: 
id 
name 
url 
Person: 
id 
name 
password 
group_id 
many 1 
Project: 
id 
name 
url 
1 
many 
many 
many 
Membership: 
person_id 
project_id
Relational Modeling: Keys 
Group: 
id 
name 
url 
Person: 
id 
name 
password 
group_id 
many 1 
Project: 
id 
name 
url 
1 
many 
many 
many 
Membership: 
person_id 
project_id 
key 
Foreign keys 
key 
key
Structured Query Language 
(SQL) 
❖ Works in lots of database servers 
❖ SQLite, MySQL, PostgreSQL, MS SQL Server 
❖ Standard way to: 
❖ Find subsets of data based on criteria 
❖ Merge data in separate tables 
❖ Compute aggregate info 
❖ Assumptions 
❖ Don’t duplicate data (“data normalization”) 
❖ Various parts of your data relate to each other 
❖ Your metadata/schema (tables/columns) doesn’t change often 
❖ Many frameworks will generate SQL for you 
❖ Ask about Database Abstraction Layers
NoSQL 
❖ Sometimes your data isn’t relational and the metadata 
changes often 
❖ Queuing, document storage, logging, real-time, low-latency, 
concurrency 
❖ Read this write up for more: 
❖ http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Tangent: JavaScript Object Notation 
(JSON) 
❖ A human-readable data exchange format 
❖ CSV, XML, YAML are some others 
❖ Example: 
❖ http://media.mongodb.org/zips.json 
❖ http://mongohub.todayclose.com (for Mac)
❖ sudo mkdir -p /data/db
MongoDB: Intro 
❖ Demo: 
❖ Command Line 
❖ MongoHub
Indexes 
❖ An index tracks keys 
❖ Convention: have an “id” column with an index on it 
❖ Why all these indexes? 
❖ Multiple ways to get at rows quickly 
❖ Creating indexes is tricky 
❖ Many frameworks include query logging to help you find 
slow queries that might need optimizing 
❖ Query optimization is a bit of an art 
❖ Use the “Explain” command
Map-Reduce Instead of SQL 
❖ Used to query large datasets 
❖ Example: Count words in a document 
❖ Map: select the data you need to operate on 
❖ “emit” one records for each word in a document, 
keyed by the word 
❖ Reduce: combine the mapped data 
❖ Sum up the uses of each word, “emitting” one 
record for each total
Picking Data Storage 
Strategies 
❖ If you just need to dump data and pull it out by some id, use a no-sql 
solution (MongoDB is simple) 
❖ flexible, easy to start with 
❖ If you are modeling an app, a relational database is usually the 
right answer (MySQL/PostgreSQL are standard) 
❖ Database modeling is REALLY important to get right at the 
start of your project, because it is a pain to change later 
❖ Names matter – choose your table names carefully 
❖ PS: we can try stuff out on Amazon’s cloud services for free
Homework 
❖ see course outline

[Mas 500] Data Basics

  • 1.
    MAS.500 - SoftwareModule - Rahul Bhargava Data Management 2014.11.21
  • 2.
    Topics ❖ RegularExpressions (online quickstart) ❖ Databases ❖ History ❖ Relational modeling ❖ Sql (mysql quickstart) ❖ Keys/Indexes ❖ No-sql (couchdb quickstart) ❖ Behind the Scenes with Ed Platt ❖ Homework
  • 3.
  • 4.
    Regular Expressions (RegEx/grep) ❖ Match a string of text by defining a pattern ❖ Useful for cleaning up or identifying data ❖ “Find” Demo on http://regexpal.com ❖ “Find/Replace” Demo with http://www.sugarscript.com/findandreplace/index.php ❖ Interested? Interactive tutorial on http://regexone.com
  • 5.
  • 6.
    Database History ❖List-based ❖ Follow link from one record to another (linked-list) ❖ File-system data stores ❖ Based on filenaming convention, limited by file i/o speeds ❖ Generic data storage and management ❖ Relational modeling or entities and relationships (ER)
  • 7.
    Relational Modeling: In English ❖ A Group has many People ❖ A Person belongs to one Group ❖ A Group has many Projects ❖ A Project belongs to one Group ❖ A Person has many Projects ❖ A Project has many People
  • 8.
    Relational Modeling: Diagram many 1 Person Group Project 1 many many many
  • 9.
    Relational Modeling: Tables Group: id name url Person: id name password group_id many 1 Project: id name url 1 many many many Membership: person_id project_id
  • 10.
    Relational Modeling: Keys Group: id name url Person: id name password group_id many 1 Project: id name url 1 many many many Membership: person_id project_id key Foreign keys key key
  • 11.
    Structured Query Language (SQL) ❖ Works in lots of database servers ❖ SQLite, MySQL, PostgreSQL, MS SQL Server ❖ Standard way to: ❖ Find subsets of data based on criteria ❖ Merge data in separate tables ❖ Compute aggregate info ❖ Assumptions ❖ Don’t duplicate data (“data normalization”) ❖ Various parts of your data relate to each other ❖ Your metadata/schema (tables/columns) doesn’t change often ❖ Many frameworks will generate SQL for you ❖ Ask about Database Abstraction Layers
  • 12.
    NoSQL ❖ Sometimesyour data isn’t relational and the metadata changes often ❖ Queuing, document storage, logging, real-time, low-latency, concurrency ❖ Read this write up for more: ❖ http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 13.
    Tangent: JavaScript ObjectNotation (JSON) ❖ A human-readable data exchange format ❖ CSV, XML, YAML are some others ❖ Example: ❖ http://media.mongodb.org/zips.json ❖ http://mongohub.todayclose.com (for Mac)
  • 14.
    ❖ sudo mkdir-p /data/db
  • 15.
    MongoDB: Intro ❖Demo: ❖ Command Line ❖ MongoHub
  • 16.
    Indexes ❖ Anindex tracks keys ❖ Convention: have an “id” column with an index on it ❖ Why all these indexes? ❖ Multiple ways to get at rows quickly ❖ Creating indexes is tricky ❖ Many frameworks include query logging to help you find slow queries that might need optimizing ❖ Query optimization is a bit of an art ❖ Use the “Explain” command
  • 17.
    Map-Reduce Instead ofSQL ❖ Used to query large datasets ❖ Example: Count words in a document ❖ Map: select the data you need to operate on ❖ “emit” one records for each word in a document, keyed by the word ❖ Reduce: combine the mapped data ❖ Sum up the uses of each word, “emitting” one record for each total
  • 18.
    Picking Data Storage Strategies ❖ If you just need to dump data and pull it out by some id, use a no-sql solution (MongoDB is simple) ❖ flexible, easy to start with ❖ If you are modeling an app, a relational database is usually the right answer (MySQL/PostgreSQL are standard) ❖ Database modeling is REALLY important to get right at the start of your project, because it is a pain to change later ❖ Names matter – choose your table names carefully ❖ PS: we can try stuff out on Amazon’s cloud services for free
  • 19.
    Homework ❖ seecourse outline