Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Factual presentation for pg west 2010

on

  • 773 views

 

Statistics

Views

Total Views
773
Views on SlideShare
773
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • "built" on living data?
  • Look for data Create a new table Sort + search it
  • Alternate: roll our own solution Persistance: Application/server restarts
  • Other attributes: number of inputs, level of consensus, etc.
  • Add new column Add new row Give inputs Do a merge

Factual presentation for pg west 2010 Factual presentation for pg west 2010 Presentation Transcript

  • Factual   Eric Lui Software Engineer, Data Storage eric@factual.com  
  • What is Factual.com? Factual is a platform for sharing, mashing, and publishing open data.
  • Crowd-Sourced Data
    • … is terrific!
      • Verifiable
      • Vote-driven
      • Customizable
  • Demo
  • Data Storage
    • Goal:
      • 10M tables
      • 1B rows (summarized)
      • 10B inputs (or "votes")
    •  
    • Raw storage
      • 1TB per input server
      • 100MB+ per dataset
  • What does all this "scale" mean?
    • Map-Reduce is the right architecture for us:
    • High volume storage
    • Scales (with the right design)
    • Shards and partitions in-place
    • Minimal downtime
    • Throwaway intermediary stages
  • What does all this "scale" mean?
    • Hard to profile
    • Hard to predict what table will get "hot"
    • Performance tuning has to be general, unless we're on a Service Level Agreement and can devote DBA resources (not our core strength)
    • Map-Reduce is not real time
  • Data Storage
    • Challenges
    •  
      • Summarization operations are memory-intensive
      • N-Way merging is expensive (ie., slow )
      • Streaming is necessary to serve back full summaries
      • Common use case is just the first N rows
  • Emerging Patterns
      • Many Reads
      • (Relatively) Few New rows
      • (Very) Few row Updates
      • Infrequent (< 1 per day) table-wide re-summarizations
  • High Availability
    • Votestore
      • 3x Redundancy
  • High Availability
    • Problem: Summarization is slow.
  • High Availability
    • Problem: Summarization is slow.
    •  
    • Solution: Build a caching layer.
  • High Availability
    • Problem: Summarization is slow.
    •  
    • Solution: Build a caching layer.
    • Cache
      • 3x Replication
      • &quot;Dumb&quot; load balancing 
      • Server Affinity (via Zookeeper)
  • Metaphor Shear
    • Why PostgreSQL?
    •  
    • Pros
      • End-user expectations map to RDBMS world
      • Indexing on common operations 
        • (ORDER BY, WHERE)
      • Full-text search
      • Latitude/longitude/geo functions with PostGIS
      • Aggregation on summarized results
      • Built-in persistence
  • Metaphor Shear
    • Why PostgreSQL?
    •  
    • Cons
      • No built-in &quot;versioning&quot;
      • Re-summarization, though infrequent, is expensive
      • Need to map lisp-based query language to SQL
  • High Availability
    • Why PostgreSQL?
    •  
    • Other considerations
      • Must pro-actively store attributes
      • Schema changes are expensive
      • Handling &quot;upsert&quot; operations is awkward 
      • Deletes are difficult (but infrequent)
        • (related) No concept of row merge
  •  
  • Demo
  • Cache Consistency
    • ACID? Not really...
    •  
    • High-concurrency 
    • favored over 
    • database-style transactions 
    •  
    •  
  • Cache Consistency
    • ACID? Not really...
    • Eventually Consistent
    •  
  • Consistency Challenges
    • Cache Invalidation
      • How do I handle new inputs?
    •  
    •  
    •  
    •  
  • Consistency Challenges
    • Cache Invalidation
      • How do I handle new inputs?
        • Shield the Input Store
          • Low-priority - shield the input store
          • Row-level invalidations
        • Lazy 
          • Fetch updated rows on summary request 
          • Leverage postgres to track invalidations
        • Decouple From Input API call
          • Async notification
    •  
    •  
    •  
    •  
    •  
  • Consistency Challenges
    • Cache Instance Management
      • How do we handle query changes? 
        • filtering out spam inputs
        • change the aggregation function
        • give more weight to table owner's votes
  • Consistency Challenges
    • Cache Instance Management
      • Simple Re-cache
        • Dump the current cached copy, and re-cache.
        • Slow
        • Poor user experience
    •  
  • Consistency Challenges
    • Cache Instance Management
      • Better solution: Double Buffering
        • Reload new version in background
        • Continue to serve current table 
          • &quot;closest match&quot; warning
        • Allow switch-back
          • Continue to accept invalidations against old table
  • Performance
    • Encoding-compliant tablespaces
    • Support UTF-8, non-Latin sort orders
    • Select Tables get SSD-based PostgreSQL caching
    • See Jignesh Shah's terrific slides from PgEast 2009
    • http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on
    • 20x improvement in random reads (IO pattern for unclustered index reads)
    • 2x improvement on sequential writes (generally pretty smooth)
  • What's next?
    • Encoding-compliant tablespaces
    • Support UTF-8, non-Latin sort orders
    • Select Tables get SSD-based PostgreSQL caching
    • See Jignesh Shah's terrific slides from PgEast 2009
    • http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on
    • 20x improvement in random reads (IO pattern for unclustered index reads)
    • 2x improvement on sequential writes (generally pretty smooth)
  •  
  • How can I use Factual?
    • Web UI 
      • Dataset Creation
      • Workbench http://www.factual.com/
    •  
    • APIs
      • Server API http://wiki.developer.factual.com/FrontPage
      • Visualizations http://wiki.developer.factual.com/Factual-Visualization-Documentation
  • Questions  
  • [email_address] Twitter: @factualinc http://blog.factual.com