[Mas 500] Data Basics


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

[Mas 500] Data Basics

  1. 1. MAS.500 - Software Module - Rahul Bhargava Data Management 2014.11.21
  2. 2. Topics ❖ Regular Expressions (online quickstart) ❖ Databases ❖ History ❖ Relational modeling ❖ Sql (mysql quickstart) ❖ Keys/Indexes ❖ No-sql (couchdb quickstart) ❖ Behind the Scenes with Ed Platt ❖ Homework
  3. 3. Regular Expressions
  4. 4. Regular Expressions (RegEx/grep) ❖ Match a string of text by defining a pattern ❖ Useful for cleaning up or identifying data ❖ “Find” Demo on http://regexpal.com ❖ “Find/Replace” Demo with http://www.sugarscript.com/findandreplace/index.php ❖ Interested? Interactive tutorial on http://regexone.com
  5. 5. Databases
  6. 6. Database History ❖ List-based ❖ Follow link from one record to another (linked-list) ❖ File-system data stores ❖ Based on filenaming convention, limited by file i/o speeds ❖ Generic data storage and management ❖ Relational modeling or entities and relationships (ER)
  7. 7. Relational Modeling: In English ❖ A Group has many People ❖ A Person belongs to one Group ❖ A Group has many Projects ❖ A Project belongs to one Group ❖ A Person has many Projects ❖ A Project has many People
  8. 8. Relational Modeling: Diagram many 1 Person Group Project 1 many many many
  9. 9. Relational Modeling: Tables Group: id name url Person: id name password group_id many 1 Project: id name url 1 many many many Membership: person_id project_id
  10. 10. Relational Modeling: Keys Group: id name url Person: id name password group_id many 1 Project: id name url 1 many many many Membership: person_id project_id key Foreign keys key key
  11. 11. Structured Query Language (SQL) ❖ Works in lots of database servers ❖ SQLite, MySQL, PostgreSQL, MS SQL Server ❖ Standard way to: ❖ Find subsets of data based on criteria ❖ Merge data in separate tables ❖ Compute aggregate info ❖ Assumptions ❖ Don’t duplicate data (“data normalization”) ❖ Various parts of your data relate to each other ❖ Your metadata/schema (tables/columns) doesn’t change often ❖ Many frameworks will generate SQL for you ❖ Ask about Database Abstraction Layers
  12. 12. NoSQL ❖ Sometimes your data isn’t relational and the metadata changes often ❖ Queuing, document storage, logging, real-time, low-latency, concurrency ❖ Read this write up for more: ❖ http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  13. 13. Tangent: JavaScript Object Notation (JSON) ❖ A human-readable data exchange format ❖ CSV, XML, YAML are some others ❖ Example: ❖ http://media.mongodb.org/zips.json ❖ http://mongohub.todayclose.com (for Mac)
  14. 14. ❖ sudo mkdir -p /data/db
  15. 15. MongoDB: Intro ❖ Demo: ❖ Command Line ❖ MongoHub
  16. 16. Indexes ❖ An index tracks keys ❖ Convention: have an “id” column with an index on it ❖ Why all these indexes? ❖ Multiple ways to get at rows quickly ❖ Creating indexes is tricky ❖ Many frameworks include query logging to help you find slow queries that might need optimizing ❖ Query optimization is a bit of an art ❖ Use the “Explain” command
  17. 17. Map-Reduce Instead of SQL ❖ Used to query large datasets ❖ Example: Count words in a document ❖ Map: select the data you need to operate on ❖ “emit” one records for each word in a document, keyed by the word ❖ Reduce: combine the mapped data ❖ Sum up the uses of each word, “emitting” one record for each total
  18. 18. Picking Data Storage Strategies ❖ If you just need to dump data and pull it out by some id, use a no-sql solution (MongoDB is simple) ❖ flexible, easy to start with ❖ If you are modeling an app, a relational database is usually the right answer (MySQL/PostgreSQL are standard) ❖ Database modeling is REALLY important to get right at the start of your project, because it is a pain to change later ❖ Names matter – choose your table names carefully ❖ PS: we can try stuff out on Amazon’s cloud services for free
  19. 19. Homework ❖ see course outline