Hypercubes In Hbase

  • 2,575 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,575
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
67
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hypercubes in HBase Fredrik Möllerstrand <fredrik@last.fm> Hadoop User Group UK, April 14 2009
  • 2. Hello. • Per Andersson, Fredrik Möllerstrand • Chalmers University of Technology, Sweden • Master thesis at last.fm
  • 3. stats.last.fm • Web statistics for in-house use. • Served out of mysql.
  • 4. stats.last.fm • y-axis: pageviews. • x-axis: time. • also: countries.
  • 5. stats.last.fm SELECT pageviews, country FROM webstatistics GROUP BY country;
  • 6. SQL: Star schema • Facts table & dimension tables. • Joins!
  • 7. Data Cube Foundations • n-dimensional cube • attribute => dimension • attribute value => measurement
  • 8. Data Cube Foundations • dimensionality reductions • projections
  • 9. Data Cube Foundations • Aggregation: sum, count, average, &c. • Data cubes modeled as: in RDBMSs, modeled as star schemas. in HBase, modeled with column-families.
  • 10. Data Cubes in HBase • Store projections, not distinct dimensions. • Pre-compute *everything*.
  • 11. Data Cubes in HBase • Rowkey: unit + time i.e. ‘pageviews-20090414’ • One column-family for every projection. i.e. ‘country-useragent’ • One qualifier per point in n-space. i.e. ‘US-safari’, ‘NO-opera’, &c.
  • 12. The SQL-DB Problem • Too much data to keep in memory. • Plenty of joins makes queries complex. • Can’t serve at mouse click rate.
  • 13. The Solution; A data store that is: • Distributed • Multi-dimensional • Magnetic(!) • Just general enough
  • 14. Enter: Zohmg.
  • 15. Zohmg; A data store that is: • Distributed • Multi-dimensional • Time-series-based • Magnetic(!) • Just general enough
  • 16. Tech • Rides on the back of Dumbo. • Stores aggregates in HBase. • Serves JSON.
  • 17. Zohmg • $> setup.py # create hbase database. • $> import.py --mapper weblogs.py # run dumbo job. • $> serve.py # start web server.
  • 18. Developers, developers. • Configuration - yaml. • Mapper - python.
  • 19. User’s configuration. project_name: webmetrics dimensions: - country - domain - useragent - usertype units: - pageviews projections: country: - country domain-usertype: - domain - usertype country-domain-useragent-usertype: - country - domain - useragent - usertype
  • 20. User’s mapper. def map(key, value): from lfm.data.parse import web log = web.parse(value) dimensions = {'country' : geoip(log.host), 'domain' : log.domain, 'useragent' : classify(log.useragent), 'usertype' : ("user", "anon")[log.userid == None] } values = {'pageviews' : 1} yield log.timestamp, dimensions, values
  • 21. Example.
  • 22. Dimensions in HBase • Column-family: country-useragent-domain • Qualifier: US-firefox-last.fm
  • 23. Questions?