Column and hadoop


Published on

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Column and hadoop

  1. 1. Columnar Database and hadoop江志伟( Alex Jiang )2012-12-1
  2. 2. Agenda •1. Column Advantage2. Storage and Process3. Hadoop Related
  3. 3. History 2001 PAX Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, … C-Store: A Column Oriented DBMS D. J. Abadi, etc: Integrating Compression and Execution in Column-O riented Database Systems. In SIGMOD, pages 671–682, 2006. D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB MS. In ICDE, pages 466–475, 2007.
  4. 4. File FormatPAXColumnar storage(Columnar) compressionPPD vs Index or MVSerDe
  5. 5. PAX(Picture From oracle blog)
  6. 6. Columnar Store vs Row Store● IO-1 (basic column store): Every storage block contain s data from only ONE column.● IO-2: Aggressive compression.● IO-3: No record-ids.● CPU-4: A column executor● CPU-5: Executor runs on compressed data.● CPU-6: Executor can process columns that are key se quence or entry sequence.
  7. 7. Columnar Store advantage● Compression RLE, Bitmap ..● Ppd reduce IO● Late Materialization less memeory and CPU overhead● Block Iteration (Vectorization) less CPU overhead● Invisible Join – block as join key
  8. 8. Compression● Run-length Encoding ● High Selectivity :● ENCODING DELTAVAL Gender ,age● Bit Vector Encoding ● Mid Selectivity :● BLOCK_DICT City , Category data skew ● Low Selectivity : compound item_id , user_id Price,quantity, comment
  9. 9. Column File Format(Picture From Vertica Blog)
  10. 10. PPDPrediction Push Down Continuous IO Compound Prediction Max-Min in each minor BlockPAX has ppd but not efficience
  11. 11. PPD(Picture from Vertica Blog)
  12. 12. late materializationConstruct RowApply Filter + ProjectionProjections column only needed(also ppd)Decoding Column FirstWait util processDifferent Compression have difference behavior
  13. 13. Early Materialization (Picture from William McKnight)
  14. 14. Late Materialization (Picture from William McKnight)
  15. 15. Common Confusion IOChoose more column ,more close to row storeIO <5% record-ID Row store free space at block tail variable length field IO Access Pattern means scalability Hardware Trend Compression rate
  16. 16. Common Confusion SerDeRow or PAX SerDe cpu cache miss no columnar compression Block Iteration (construct tuple or row)Java vs C/C++ C/c++ direct memory mapping Java Fastutil
  17. 17. Index and MVReduce IO ScalabilityAvoid Sort Storange cost Index join Complex desigeLookup Hard maintainPre-computation : High latency Join Slow down loading Group by Lost DetailsQuery Rewrite
  18. 18. Data ModelingFat table vs 3NF
  19. 19. Hadoop RelatedFile Format Trenvi vs IBM CIF Schema Evolution Portable File Format Bigger Block Size IO Pattern SerDe network influence
  20. 20. Hadoop RelatedStorage CostNameNode Less block Bigger block size Cold data even bigger No Intermediate LevelJobTracker Each Job have Less Map and reduce numberDataNode
  21. 21. Hadoop RelatedReal Data ingestion Hbase + Flume Balanced Data Write avro file format first, then sort mergeSerDe memory reduce Tuple Structure not rowBatch Update+Delete+Insert
  22. 22. Hadoop RelatedMR Performance Boost Block Shuffle (3 times faster) Skew data have less overhead Less map number and bigger spill Reduce side combine Light Compression Codec(snappy not LZO) Combiner or in-memroy combiner deprecated
  23. 23. Hadoop RelatedEasier Performance Tuning mapred.min.split.size(deprecated) io.sort.mb io.sort.spill.percent(deprecated) Io.sort.factor mapred.reduce.parallel.copies(deprecated) Map and reduce number easier estimate Reduce algorithm will change
  24. 24. Hadoop RelatedEasy Management Less Partition or Dynamic Partition Integrity constraints and Referential integrity Statistic make simple query engine Cold Data automatic merge Trojan Layout vs Columnar ProjectionsLess Design complexity Map join vs Fat Table Group by + Index
  25. 25. Reference●●●● DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010● Trenvi●
  26. 26. Thank you! Q&AAlex Jianggemini5201314 at gmail dot com