Data Warehousing 101(and a video)

1,187

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,187
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
59
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Data Warehousing 101(and a video)

  1. 1. Data Warehousing 101 Everything you never wanted to know about big databases but were forced to find out anyway Josh Berkus Open Source Bridge 2011
  2. 2. contentscovering not covering● concepts of DW ● hardware● some DW ● analytics/reporting techniques tools● databases
  3. 3. BIGDATA
  4. 4. 1970
  5. 5. What is a“data warehouse”?
  6. 6. Big Data?
  7. 7. OLTP vs DW● many single-row ● few large batch writes imports● current data ● years of data● queries generated ● queries generated by user activity by large reports● < 1s response ● queries can run for times minutes/hours● 1000s of users ● 10s of users
  8. 8. OLTP vs DW big data for big data many for low concurrent concurrency requests to requests to verysmall amounts large amounts of data each of data each
  9. 9. synonyms &subclasses
  10. 10. archiving
  11. 11. archivingWORN data: “write once, read never”● grows indefinitely● usually a result of regulatory compliance● main concern: storage efficiency
  12. 12. data mining
  13. 13. data miningthe database where you dont know whats inthere, but you want to find out● lots of data (TB to PB)● mostly “semi-structured”● data produced as a side effect of other business processes● needs CPU-intensive processing
  14. 14. BI: Business IntelligenceDSS: Decision SupportOLAP: Online Analytical Processing Analytics
  15. 15. BI/DSS/OLAP/Analytics
  16. 16. BI/DSS/OLAP/Analyticsdatabases which support visualization oflarge amounts of data● data is fairly well understood● most data can be reduced to categories, geography, and taxonomy● primarily about indexing
  17. 17. What is a“dimension”?
  18. 18. dimensions vs. facts customers / accounts Fact Table category subcategorysub-subcategory
  19. 19. dimension examples● location/region/country/quadrant● product categorization● URL● transaction type● account heirarchy● IP address● OS/version/build
  20. 20. dimension synonyms ● facet ● taxonomy ● secondary index ● view
  21. 21. What is ETL?
  22. 22. Extract, Transform, Load● how you turn external raw data into useful database data ● Apache logs → web analytics DB ● CSV POS files → financial reporting DB ● OLTP server → 10-year data warehouse● also called ELT when the transformation is done inside the database
  23. 23. Purpose of ETL/ELT getting data into the data warehouse● clean up garbage data● split out attributes● “normalize” dimensional data● deduplication● calculate materialized views / indexes
  24. 24. ETL Tools K.E.T.T.L.E.
  25. 25. ETL Tools
  26. 26. Ad-hoc scripting
  27. 27. ELT Tips think volume● bulk processing or parallel processing ● no row-at-a-time, document-at-a-time● insert into permanent storage should be the last step ● no updates
  28. 28. Queues not Extract
  29. 29. What kind ofdatabase should I use for DW?
  30. 30. 5 Types1. Standard Relational2. MPP3. Column Store ``4. Map/Reduce5. Enterprise Search
  31. 31. standard relational
  32. 32. standard relationalthe all-purpose solution for not-that-big data● adequate for all tasks ● but not excellent at any of them● easy to use ● low resource requirements ● well-supported by all software ● familiar● not suitable for really big data
  33. 33. Sweet Spots 0 5 10 15 20 25 30 MySQL PostgreSQLDW Database 0 5 10 15 20 25 30
  34. 34. Whats MPP?
  35. 35. MassivelyParallelProcessing
  36. 36. appliance software
  37. 37. MPP cpu-intensive data warehousing● data mining, some analytics● supporting complex query logic● moderately big data (1-200TB)● drawbacks: proprietary, expensive● now hybridizes ● with other types
  38. 38. Whats acolumn store?
  39. 39. column store
  40. 40. column storeinversion of a row store: indexes become data data becomes indexes
  41. 41. column stores
  42. 42. column storesfor aggregations and transformations ofhighly structured data● good for BI, analytics, some archiving● moderately big data (0.5-100TB)● bad for data mining● slow to add new data / purge data● usually support compression
  43. 43. Whatsmap/reduce?
  44. 44. map/reduce
  45. 45. map/reduce
  46. 46. map/reduce// mapfunction(doc) { for (var i in doc.links) emit([doc.parent, i], null); }}// reducefunction(keys, values) { return null;}
  47. 47. map/reduce// Mapfunction (doc) { emit(doc.val, doc.val)}// Reducefunction (keys, values, rereduce) { // This computes the standard deviation of the mapped results var stdDeviation=0.0; var count=0; var total=0.0; var sqrTotal=0.0; if (!rereduce) { // This is the reduce phase, we are reducing over emitted values from // the map functions. for(var i in values) { total = total + values[i]; sqrTotal = sqrTotal + (values[i] * values[i]); } count = values.length; } else { // This is the rereduce phase, we are re-reducing previosuly // reduced values. for(var i in values) { count = count + values[i].count; total = total + values[i].total; sqrTotal = sqrTotal + values[i].sqrTotal; } } var variance = (sqrTotal - ((total * total)/count)) / count; stdDeviation = Math.sqrt(variance); // the reduce result. It contains enough information to be rereduced // with other reduce results. return {"stdDeviation":stdDeviation,"count":count, "total":total,"sqrTotal":sqrTotal};};
  48. 48. map/reduce vs. MPP● open source ● proprietary● petabytes ● terabytes● write routines by ● advanced query hand support● inefficient ● efficient● generic ● specific● cheap HW / cloud ● needs good HW● DIY tools ● integrated tools
  49. 49. Whats enterprise search?
  50. 50. enterprise search ElasticSearch
  51. 51. enterprise searchwhen you need to do DW with a huge pile ofpartly processed “documents”● does: light data mining, light BI/analytics ● best “full text” and keyword search● supports “approximate results”● lots of special features for web data
  52. 52. E.S. vs. C-Store● batch load ● batch load● semi-structured ● fully normalized data data● uncompressed ● compressed● star schema ● snowflake schema● sharding ● parallel query● approximate results ● exact results
  53. 53. Whats awindowing query?
  54. 54. regular aggregate
  55. 55. windowing function
  56. 56. TABLE events ( event_id INT, event_type TEXT, start TIMESTAMPTZ, duration INTERVAL, event_desc TEXT);
  57. 57. SELECT MAX(concurrent)FROM ( SELECT SUM(tally) OVER (ORDER BY start) AS concurrent FROM ( SELECT start, 1::INT as tally FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;
  58. 58. stream processing SQL● replace multiple queries with a single query ● avoid scanning large tables multiple times● replace pages of application code ● and MB of data transmission● SQL alternative to map/reduce ● (for some data mining tasks)
  59. 59. Whats amaterialized view?
  60. 60. query results as table● calculate once, read many time ● complex/expensive queries ● frequently referenced● not necessarily a whole query ● often part of a query● might be manually or automatically updated ● depends on product
  61. 61. non-relational matviews● CouchDB Views ● cache results of map/reduce jobs ● updated on data read● Solr / Elastic Search “Faceted Search” ● cached indexed results of complex searches ● updated on data change
  62. 62. maintaining matviewsBEST: update matviews at batch load timeGOOD: update matview according to clock/calendarFAIR: update matview on data requestBAD for DW: update matviews using a trigger
  63. 63. matview tips● matviews should be small ● 1/10 to ¼ of RAM on each node● each matview should support several queries ● or one really really important one● truncate + append, dont update● index matviews like crazy ● if they are not indexes themselves
  64. 64. Whats OLAP?
  65. 65. t Visit ors Repeacubes Site r se ow Br
  66. 66. drill-down
  67. 67. OLAP● OnLine Analytical Processing● Visualization technique ● all data as a multi-dimensional space ● great for decision support● CPU & RAM intensive ● hard to do on really big data● Works well with column stores
  68. 68. Contact● Josh Berkus: josh@pgexperts.com ● blog: blogs.ittoolbox.com/database/soup ● twitter: @fuzzychef● PostgreSQL: www.postgresql.org ● pgexperts: www.pgexperts.com This talk is copyright 2011 Josh Berkus and is licensed under the Creative Commons Attribution license. Many images were taken from google images and are copyright their original creators, whom I dont actually know. Logos are trademark their respective owners, and are used here under fair use.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×