Your SlideShare is downloading. ×

Column and hadoop


Published on

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Columnar Database and hadoop江志伟( Alex Jiang )2012-12-1
  • 2. Agenda •1. Column Advantage2. Storage and Process3. Hadoop Related
  • 3. History 2001 PAX Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, … C-Store: A Column Oriented DBMS D. J. Abadi, etc: Integrating Compression and Execution in Column-O riented Database Systems. In SIGMOD, pages 671–682, 2006. D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB MS. In ICDE, pages 466–475, 2007.
  • 4. File FormatPAXColumnar storage(Columnar) compressionPPD vs Index or MVSerDe
  • 5. PAX(Picture From oracle blog)
  • 6. Columnar Store vs Row Store● IO-1 (basic column store): Every storage block contain s data from only ONE column.● IO-2: Aggressive compression.● IO-3: No record-ids.● CPU-4: A column executor● CPU-5: Executor runs on compressed data.● CPU-6: Executor can process columns that are key se quence or entry sequence.
  • 7. Columnar Store advantage● Compression RLE, Bitmap ..● Ppd reduce IO● Late Materialization less memeory and CPU overhead● Block Iteration (Vectorization) less CPU overhead● Invisible Join – block as join key
  • 8. Compression● Run-length Encoding ● High Selectivity :● ENCODING DELTAVAL Gender ,age● Bit Vector Encoding ● Mid Selectivity :● BLOCK_DICT City , Category data skew ● Low Selectivity : compound item_id , user_id Price,quantity, comment
  • 9. Column File Format(Picture From Vertica Blog)
  • 10. PPDPrediction Push Down Continuous IO Compound Prediction Max-Min in each minor BlockPAX has ppd but not efficience
  • 11. PPD(Picture from Vertica Blog)
  • 12. late materializationConstruct RowApply Filter + ProjectionProjections column only needed(also ppd)Decoding Column FirstWait util processDifferent Compression have difference behavior
  • 13. Early Materialization (Picture from William McKnight)
  • 14. Late Materialization (Picture from William McKnight)
  • 15. Common Confusion IOChoose more column ,more close to row storeIO <5% record-ID Row store free space at block tail variable length field IO Access Pattern means scalability Hardware Trend Compression rate
  • 16. Common Confusion SerDeRow or PAX SerDe cpu cache miss no columnar compression Block Iteration (construct tuple or row)Java vs C/C++ C/c++ direct memory mapping Java Fastutil
  • 17. Index and MVReduce IO ScalabilityAvoid Sort Storange cost Index join Complex desigeLookup Hard maintainPre-computation : High latency Join Slow down loading Group by Lost DetailsQuery Rewrite
  • 18. Data ModelingFat table vs 3NF
  • 19. Hadoop RelatedFile Format Trenvi vs IBM CIF Schema Evolution Portable File Format Bigger Block Size IO Pattern SerDe network influence
  • 20. Hadoop RelatedStorage CostNameNode Less block Bigger block size Cold data even bigger No Intermediate LevelJobTracker Each Job have Less Map and reduce numberDataNode
  • 21. Hadoop RelatedReal Data ingestion Hbase + Flume Balanced Data Write avro file format first, then sort mergeSerDe memory reduce Tuple Structure not rowBatch Update+Delete+Insert
  • 22. Hadoop RelatedMR Performance Boost Block Shuffle (3 times faster) Skew data have less overhead Less map number and bigger spill Reduce side combine Light Compression Codec(snappy not LZO) Combiner or in-memroy combiner deprecated
  • 23. Hadoop RelatedEasier Performance Tuning mapred.min.split.size(deprecated) io.sort.mb io.sort.spill.percent(deprecated) Io.sort.factor mapred.reduce.parallel.copies(deprecated) Map and reduce number easier estimate Reduce algorithm will change
  • 24. Hadoop RelatedEasy Management Less Partition or Dynamic Partition Integrity constraints and Referential integrity Statistic make simple query engine Cold Data automatic merge Trojan Layout vs Columnar ProjectionsLess Design complexity Map join vs Fat Table Group by + Index
  • 25. Reference●●●● DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010● Trenvi●
  • 26. Thank you! Q&AAlex Jianggemini5201314 at gmail dot com