Loose-Schema Databases and Heterogenous Data
Upcoming SlideShare
Loading in...5

Loose-Schema Databases and Heterogenous Data



Traditional databases are designed around homogenous data. Heterogenous data can be shoehorned into a homogenous format, but constraints are hard to enforce. A new generation of databases is ...

Traditional databases are designed around homogenous data. Heterogenous data can be shoehorned into a homogenous format, but constraints are hard to enforce. A new generation of databases is emerging, which embraces data in varying formats from varying sources. Access may be with familiar relational operations, or radically different operations such as Map-Reduce. In this talk Doug will introduce us on what is happing in post-sql database space.

BIO: Doug Reeder has taught robotics and programmed for researchers at OSU. Currently he writes task-list management software in JavaScript for HP webOS smartphones and tablets.



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.techgig.com 1



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • range of BP values not same as range of blood glucose, systolic must be greater than diastolic
  • Designed to store and report on large amounts of “semi-structured” data “shared-nothing" clustering written in Erlang for robust concurrent systems; no shared state threading and all data is immutable no fixed schema, no declarative schema, update function can check consistency and permissions incremental replication, partial replicas possible; also runs on handhelds
  • Simple map func. Can extract subset of documents map+reduce can collect related documents (e.g. blog post & comments)
  • Your company uses a traditional Project Planning system, which works in terms of man-hours Blue system has project with two subprojects Your group uses an online task-tracking system like Basecamp Yellow system has three tasks, one of which has a subtask There is also software associated with the CouchDB database, which allows you to specify that yellow tasks are components of blue subprojects From this, we assign a path to each item.
  • Goal: Calculate # completed tasks for each item [left] documents in CouchDB (blanks are not null, but fields not present) Map function written in JavaScript, emits 0 or more values, to ANY bin (key) [go through map function for each record]

Loose-Schema Databases and Heterogenous Data Presentation Transcript

  • 1. Loose-Schema Databases and Heterogenous Data
    • Doug Reeder
    • 2. Hominid Software (µ-ISV)
    • 3. Task-list management for HP webOS
  • 4. Heterogenous Data
    • Different communities conceptualize similar things differently
      • Music recording: artist vs. composer+performer+conductor
      • 5. Users expect computers to adapt to them
    • Single-source data can be heterogenous
      • Medical tests: any given patient has only had a few of many tests
  • 6. Heterogenous Data, Homogenous Schema
    • converting to homogenous format -> loss or alteration of data.
      • Contact: title+given+middle+family+suffix -> given+family
      • 7. Acceptable for 1-way transfer
    • round-trip data to external sources
      • alteration generates a spurious change notification
      • 8. losing data is unacceptable
  • 9. Heterogenous Data, Manual Union Schema
    • Union of all fields in any source schema
    • 10. schema must be updated before data with new source schema can be stored
    • 11. storage may be inefficient
    • 12. works fine for music meta-data, works poorly for medical tests
  • 13. Loose-Schema DB
    • Need not specify fields & types in advance
    • 14. Array & object fields allowed
      • If value is atomic for the purposes of the database, it doesn't violate 1NF
    • Still normalize and selectively denormalize
    • 15. May still enforce field constraints
    • 16. Can't enforce foreign-key relations (no JOINs)
  • 17. CouchDB
    • stores JSON "documents"
    • 18. RESTful HTTP API + client libs
    • 19. crash-only file is always consistent
    • 20. lockless & optimistic
      • if revision number doesn't match, update fails.
      • 21. Resolution similar to Subversion: client merge, re-submit
  • 22. CouchDB Queries
    • only Map-Reduce on pre-defined views
    • 23. 1st query against a view is slow; later only require incremental updating
    • 24. map func (JS or Erlang) maps document to N-dimensional array of buckets
    • 25. reduce func (optional) computes sums, averages, unions, etc over one or more dimensions
  • 26. Data From Two Task Systems
  • 27. Map-Reduce Completed Tasks Map: for each ancestor path, write [# completed, 1] Reduce: sum 1 st and 2 nd member of pair
  • 28. Conclusion
    • Heterogenous data may be more naturally processed using a non-relational model
    • 29. couchdb.apache.org
    • 30. Questions?