Column and hadoop
Upcoming SlideShare
Loading in...5
×
 

Column and hadoop

on

  • 1,454 views

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

Statistics

Views

Total Views
1,454
Views on SlideShare
1,454
Embed Views
0

Actions

Likes
2
Downloads
33
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Column and hadoop Column and hadoop Presentation Transcript

  • Columnar Database and hadoop江志伟( Alex Jiang )2012-12-1
  • Agenda •1. Column Advantage2. Storage and Process3. Hadoop Related
  • History 2001 PAX Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, … C-Store: A Column Oriented DBMS D. J. Abadi, etc: Integrating Compression and Execution in Column-O riented Database Systems. In SIGMOD, pages 671–682, 2006. D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB MS. In ICDE, pages 466–475, 2007.
  • File FormatPAXColumnar storage(Columnar) compressionPPD vs Index or MVSerDe
  • PAX(Picture From oracle blog)
  • Columnar Store vs Row Store● IO-1 (basic column store): Every storage block contain s data from only ONE column.● IO-2: Aggressive compression.● IO-3: No record-ids.● CPU-4: A column executor● CPU-5: Executor runs on compressed data.● CPU-6: Executor can process columns that are key se quence or entry sequence.
  • Columnar Store advantage● Compression RLE, Bitmap ..● Ppd reduce IO● Late Materialization less memeory and CPU overhead● Block Iteration (Vectorization) less CPU overhead● Invisible Join – block as join key
  • Compression● Run-length Encoding ● High Selectivity :● ENCODING DELTAVAL Gender ,age● Bit Vector Encoding ● Mid Selectivity :● BLOCK_DICT City , Category data skew ● Low Selectivity : compound item_id , user_id Price,quantity, comment
  • Column File Format(Picture From Vertica Blog)
  • PPDPrediction Push Down Continuous IO Compound Prediction Max-Min in each minor BlockPAX has ppd but not efficience
  • PPD(Picture from Vertica Blog)
  • late materializationConstruct RowApply Filter + ProjectionProjections column only needed(also ppd)Decoding Column FirstWait util processDifferent Compression have difference behavior
  • Early Materialization (Picture from William McKnight)
  • Late Materialization (Picture from William McKnight)
  • Common Confusion IOChoose more column ,more close to row storeIO <5% record-ID Row store free space at block tail variable length field IO Access Pattern means scalability Hardware Trend Compression rate
  • Common Confusion SerDeRow or PAX SerDe cpu cache miss no columnar compression Block Iteration (construct tuple or row)Java vs C/C++ C/c++ direct memory mapping Java Fastutil
  • Index and MVReduce IO ScalabilityAvoid Sort Storange cost Index join Complex desigeLookup Hard maintainPre-computation : High latency Join Slow down loading Group by Lost DetailsQuery Rewrite
  • Data ModelingFat table vs 3NF
  • Hadoop RelatedFile Format Trenvi vs IBM CIF Schema Evolution Portable File Format Bigger Block Size IO Pattern SerDe network influence
  • Hadoop RelatedStorage CostNameNode Less block Bigger block size Cold data even bigger No Intermediate LevelJobTracker Each Job have Less Map and reduce numberDataNode
  • Hadoop RelatedReal Data ingestion Hbase + Flume Balanced Data Write avro file format first, then sort mergeSerDe memory reduce Tuple Structure not rowBatch Update+Delete+Insert
  • Hadoop RelatedMR Performance Boost Block Shuffle (3 times faster) Skew data have less overhead Less map number and bigger spill Reduce side combine Light Compression Codec(snappy not LZO) Combiner or in-memroy combiner deprecated
  • Hadoop RelatedEasier Performance Tuning mapred.min.split.size(deprecated) mapred.child.java.opts mapred.compress.map.output(deprecated) io.sort.mb io.sort.spill.percent(deprecated) Io.sort.factor mapred.reduce.parallel.copies(deprecated) Map and reduce number easier estimate Reduce algorithm will change
  • Hadoop RelatedEasy Management Less Partition or Dynamic Partition Integrity constraints and Referential integrity Statistic make simple query engine Cold Data automatic merge Trojan Layout vs Columnar ProjectionsLess Design complexity Map join vs Fat Table Group by + Index
  • Reference● http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/● http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf● http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases/● DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010● Trenvi http://avro.apache.org/docs/current/trevni/spec.html● http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/
  • Thank you! Q&AAlex Jianggemini5201314 at gmail dot comhttp://www.gemini5201314.net