Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data compression for Data Lakes

1,366 views

Published on

A short presentation about Row based compression in Data Lake with Hadoop.

The first version of source code is available at https://github.com/viadeo/viadeo-avro-utils .

The code use Avro for metadata management, which makes the task easier, but the ideas can be ported to other formats.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data compression for Data Lakes

  1. 1. Data Management Compression for Data Lakes Jonathan WINANDY primatice.com
  2. 2. About ME • I am (Not Only) a Data Engineer
  3. 3. Contents • Data Lake : What is a Data Lake ? • Compute : What can we do with the gathered data ? • Conclusion !
  4. 4. Jobs OLAP DB Backend App Caches DB Cluster DB Cluster App V1 Jobs Specific DB Specific DB Monitoring Authentification HDFS / HADOOP Deploy ? App V2 Infrastructure > The Data Lake* (*) you actually need more than just Hadoop to make a data lake. Data Lake
  5. 5. Data Lake Comment We use HDFS faire circuler to store la a data copy entre of all les Data. différents stockages ? And we use HADOOP to create views on this Data. HDFS / HADOOP DB Cluster DB Cluster OLAP DB Specific DB Specific DB
  6. 6. Photos de la base de données DB Cluster DB Cluster Offloading to HDFS Offloading data is just making a copy of databases data onto HDFS as a directory with a couple of files (parts). HDFS / HADOOP De : sql://$schema/$table Vers : hdfs://staging/fromschema=$schema/ fromtable=$table/importdatetime=$time Data Lake (HDFS is ‘just’ a Distributed File System)
  7. 7. Photos de la base de données For relational databases, we can use apache sqoop to copy for example the table payments from schema sales into hdfs://staging/sales/payments/2014-09-22T18_01_10 into hdfs://staging/sales/payments/2014-09-21T17_50_25 into hdfs://staging/sales/payments/2014-09-20T18_32_43 DB Cluster DB Cluster HDFS / HADOOP De : sql://$schema/$table Vers : hdfs://staging/fromschema=$schema/ fromtable=$table/importdatetime=$time Data Lake (HDFS is ‘just’ a Distributed File System)
  8. 8. Data Lake A light capacity planning for your potential cluster : • If you have 100 GB of binary compressed data*, • With a cost of storage around 0,03$/per GB/per month, • An offload every day would cost 90$/per month of kept Data, • In six month, this offloading would cost 1900$ and weight 18TB (<- this is Big Data). * obviously not plain text/JSON
  9. 9. Data Lake But for this 18TB, you have quite a lot of features : • You can track and understand bugs and data corruption, • Analysing the business without harming your production DBs, • Bootstrap new products based on your existing Data, • And also have now a real excuse to learn Hadoop or make some use of your existing Hadoop bare metal clusters !
  10. 10. Analyse des photos Compute Having a couple of snapshots in HDFS, we can use the tremendous power of MapReduce to join over the snapshots, and compute what changed in the database during a day. Extraction des photos de tables SQL sur HDFS, puis comparaison avec les photos des jours précédents : delta = f( J , J - 1 ) merge = f( J , J - 1 , J - 2 , J - … , …)
  11. 11. Analyse des photos Compute Δ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’. Inputs : Output : Input / Output : id name 1 Jon 2 Bob 3 James id name 1 John 3 James 4 Jim Δ( , ) = id name dm 1 John 01 1 Jon 10 3 James 11 4 Jim 01 2 Bob 10 hhttttppss::////ggiitthhuubb..ccoomm//vviiaaddeeoo//vviiaaddeeoo--aavvrroo--uuttiillss https://github.com/viadeo/viadeo-avro-utils
  12. 12. Analyse des photos Δ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’. Inputs (generalised) : Output : Input / Output (généralisé) : id 1 2 3 Δ( , , , ) = id dm 1 1111 2 1000 3 1101 4 0110 5 0011 id 1 3 4 id 1 4 5 id 1 3 5 https://github.com/viadeo/viadeo-avro-utils https://github.com/viadeo/viadeo-avro-utils Compute
  13. 13. EfAfniaclyasicngit téhe d coemsp preshsoiont o: s merging 50 snapshots Compute Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect
  14. 14. Efficacité des photos Small datasets OK, Just x2 get a bit of overhead WINS !! Analysing the compression : merging 50 snapshots Compute Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect
  15. 15. Conclusion • So now the offloading might cost just 36$ of storage for 6 month (6$ per month). • Welcome back at small data ! • But what does this mean for production Data ? Did you really need to store that in a mutable database in the first place ?
  16. 16. Thank you, Questions ?

×