Your SlideShare is downloading. ×
0
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Column and hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Column and hadoop

996

Published on

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
996
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Columnar Database and hadoop江志伟( Alex Jiang )2012-12-1
  • 2. Agenda •1. Column Advantage2. Storage and Process3. Hadoop Related
  • 3. History 2001 PAX Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, … C-Store: A Column Oriented DBMS D. J. Abadi, etc: Integrating Compression and Execution in Column-O riented Database Systems. In SIGMOD, pages 671–682, 2006. D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB MS. In ICDE, pages 466–475, 2007.
  • 4. File FormatPAXColumnar storage(Columnar) compressionPPD vs Index or MVSerDe
  • 5. PAX(Picture From oracle blog)
  • 6. Columnar Store vs Row Store● IO-1 (basic column store): Every storage block contain s data from only ONE column.● IO-2: Aggressive compression.● IO-3: No record-ids.● CPU-4: A column executor● CPU-5: Executor runs on compressed data.● CPU-6: Executor can process columns that are key se quence or entry sequence.
  • 7. Columnar Store advantage● Compression RLE, Bitmap ..● Ppd reduce IO● Late Materialization less memeory and CPU overhead● Block Iteration (Vectorization) less CPU overhead● Invisible Join – block as join key
  • 8. Compression● Run-length Encoding ● High Selectivity :● ENCODING DELTAVAL Gender ,age● Bit Vector Encoding ● Mid Selectivity :● BLOCK_DICT City , Category data skew ● Low Selectivity : compound item_id , user_id Price,quantity, comment
  • 9. Column File Format(Picture From Vertica Blog)
  • 10. PPDPrediction Push Down Continuous IO Compound Prediction Max-Min in each minor BlockPAX has ppd but not efficience
  • 11. PPD(Picture from Vertica Blog)
  • 12. late materializationConstruct RowApply Filter + ProjectionProjections column only needed(also ppd)Decoding Column FirstWait util processDifferent Compression have difference behavior
  • 13. Early Materialization (Picture from William McKnight)
  • 14. Late Materialization (Picture from William McKnight)
  • 15. Common Confusion IOChoose more column ,more close to row storeIO <5% record-ID Row store free space at block tail variable length field IO Access Pattern means scalability Hardware Trend Compression rate
  • 16. Common Confusion SerDeRow or PAX SerDe cpu cache miss no columnar compression Block Iteration (construct tuple or row)Java vs C/C++ C/c++ direct memory mapping Java Fastutil
  • 17. Index and MVReduce IO ScalabilityAvoid Sort Storange cost Index join Complex desigeLookup Hard maintainPre-computation : High latency Join Slow down loading Group by Lost DetailsQuery Rewrite
  • 18. Data ModelingFat table vs 3NF
  • 19. Hadoop RelatedFile Format Trenvi vs IBM CIF Schema Evolution Portable File Format Bigger Block Size IO Pattern SerDe network influence
  • 20. Hadoop RelatedStorage CostNameNode Less block Bigger block size Cold data even bigger No Intermediate LevelJobTracker Each Job have Less Map and reduce numberDataNode
  • 21. Hadoop RelatedReal Data ingestion Hbase + Flume Balanced Data Write avro file format first, then sort mergeSerDe memory reduce Tuple Structure not rowBatch Update+Delete+Insert
  • 22. Hadoop RelatedMR Performance Boost Block Shuffle (3 times faster) Skew data have less overhead Less map number and bigger spill Reduce side combine Light Compression Codec(snappy not LZO) Combiner or in-memroy combiner deprecated
  • 23. Hadoop RelatedEasier Performance Tuning mapred.min.split.size(deprecated) mapred.child.java.opts mapred.compress.map.output(deprecated) io.sort.mb io.sort.spill.percent(deprecated) Io.sort.factor mapred.reduce.parallel.copies(deprecated) Map and reduce number easier estimate Reduce algorithm will change
  • 24. Hadoop RelatedEasy Management Less Partition or Dynamic Partition Integrity constraints and Referential integrity Statistic make simple query engine Cold Data automatic merge Trojan Layout vs Columnar ProjectionsLess Design complexity Map join vs Fat Table Group by + Index
  • 25. Reference● http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/● http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf● http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases/● DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010● Trenvi http://avro.apache.org/docs/current/trevni/spec.html● http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/
  • 26. Thank you! Q&AAlex Jianggemini5201314 at gmail dot comhttp://www.gemini5201314.net

×