Factual presentation for pg west 2010

1,007 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,007
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • "built" on living data?
  • Look for data
    Create a new table
    Sort + search it
  • Alternate: roll our own solution
    Persistance: Application/server restarts
  • Other attributes: number of inputs, level of consensus, etc.
  • Add new column
    Add new row
    Give inputs
    Do a merge
  • Factual presentation for pg west 2010

    1. 1. Factual   Eric Lui Software Engineer, Data Storage eric@factual.com   
    2. 2. What is Factual.com? Factual is a platform for sharing,  mashing, and publishing open data.
    3. 3. Crowd-Sourced Data … is terrific! • Verifiable • Vote-driven • Customizable
    4. 4. Demo
    5. 5. Data Storage Goal: • 10M tables  • 1B rows (summarized) • 10B inputs (or "votes")   Raw storage • 1TB per input server • 100MB+ per dataset
    6. 6. What does all this "scale" mean? Map-Reduce is the right architecture for us: •High volume storage •Scales (with the right design) •Shards and partitions in-place •Minimal downtime •Throwaway intermediary stages
    7. 7. What does all this "scale" mean? •Hard to profile •Hard to predict what table will get "hot" •Performance tuning has to be general, unless we're on a  Service Level Agreement and can devote DBA resources (not  our core strength) •Map-Reduce is not real time
    8. 8. Data Storage Challenges   • Summarization operations are memory-intensive • N-Way merging is expensive (ie., slow) • Streaming is necessary to serve back full summaries • Common use case is just the first N rows
    9. 9. Emerging Patterns • Many Reads • (Relatively) Few New rows • (Very) Few row Updates • Infrequent (< 1 per day) table-wide re-summarizations
    10. 10. High Availability Votestore • 3x Redundancy
    11. 11. High Availability Problem: Summarization is slow.
    12. 12. High Availability Problem: Summarization is slow. Solution: Build a caching layer.
    13. 13. High Availability Problem: Summarization is slow. Solution: Build a caching layer. Cache • 3x Replication • "Dumb" load balancing • Server Affinity (via Zookeeper)
    14. 14. Metaphor Shear Why PostgreSQL? Pros • End-user expectations map to RDBMS world • Indexing on common operations o (ORDER BY, WHERE) • Full-text search • Latitude/longitude/geo functions with PostGIS • Aggregation on summarized results • Built-in persistence
    15. 15. Metaphor Shear Why PostgreSQL? Cons • No built-in "versioning" • Re-summarization, though infrequent, is expensive • Need to map lisp-based query language to SQL
    16. 16. High Availability Why PostgreSQL? Other considerations • Must pro-actively store attributes • Schema changes are expensive • Handling "upsert" operations is awkward • Deletes are difficult (but infrequent) • (related) No concept of row merge
    17. 17. Demo
    18. 18. Cache Consistency ACID? Not really... High-concurrency  favored over database-style transactions 
    19. 19. Cache Consistency ACID? Not really... Eventually Consistent
    20. 20. Consistency Challenges Cache Invalidation • How do I handle new inputs?
    21. 21. Consistency Challenges Cache Invalidation • How do I handle new inputs? o Shield the Input Store  Low-priority - shield the input store  Row-level invalidations o Lazy  Fetch updated rows on summary request  Leverage postgres to track invalidations o Decouple From Input API call  Async notification
    22. 22. Consistency Challenges Cache Instance Management • How do we handle query changes? o filtering out spam inputs o change the aggregation function o give more weight to table owner's votes
    23. 23. Consistency Challenges Cache Instance Management • Simple Re-cache o Dump the current cached copy, and re-cache. o Slow o Poor user experience
    24. 24. Consistency Challenges Cache Instance Management • Better solution: Double Buffering o Reload new version in background o Continue to serve current table  "closest match" warning o Allow switch-back  Continue to accept invalidations against old table
    25. 25. Performance Encoding-compliant tablespaces •Support UTF-8, non-Latin sort orders Select Tables get SSD-based PostgreSQL caching •See Jignesh Shah's terrific slides from PgEast 2009 •http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on •20x improvement in random reads (IO pattern for unclustered index reads) •2x improvement on sequential writes (generally pretty smooth)
    26. 26. What's next? Encoding-compliant tablespaces •Support UTF-8, non-Latin sort orders Select Tables get SSD-based PostgreSQL caching •See Jignesh Shah's terrific slides from PgEast 2009 •http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on •20x improvement in random reads (IO pattern for unclustered index reads) •2x improvement on sequential writes (generally pretty smooth)
    27. 27. How can I use Factual? Web UI • Dataset Creation • Workbench http://www.factual.com/ APIs • Server API http://wiki.developer.factual.com/FrontPage • Visualizations http://wiki.developer.factual.com/Factual-Visualization- Documentation
    28. 28. Questions
    29. 29. eric@factual.com Twitter: @factualinc http://blog.factual.com

    ×