BUILDING MODERN DATA
LAKES
Minio, Spark and Unified Data Architecture work in unison
By Ravi Shankar, October 2018
10/29/18
1
FIRST ORDER LOGIC
10/29/18
2
First-order logic—also known as first-order predicate calculus and predicate logic - is a collection
of formal systems used in mathematics, philosophy, linguistics, and computer science.
Married("Harry", "Sally", "12-Dec-1995").
IsMotherOf("Sally", "Peter").
IsFatherOf("Harry", "Peter").
The Relational Model says that in your
database this is how you think about and
represent all your data
There exists one or more X such that the
marriage happened in 1995
THE DATA MODELS.
10/29/18
3
subject-oriented, integrated, time-
variant and non-volatile collection
of data
integrating data marts into a
dimensional model for
consumption
PROBLEM STATEMENT.
10/29/18
4
Earlier New Digitalization Initiatives !!!
1. Change everything
2. Keep as is. Add new relations
3. Move to:
CTO GETS HADOOP IN.
10/29/18
5
1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
10/29/18
6
ALL WENT SMOOTH UNTIL...
A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map
reduce program to process it
File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication
- total size about 26 GB
Executed the application – What might have happened ?
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado
opRDD.scala
10/29/18
7
SPLITTABILITY IMPORTANCE.
So, performance is not guaranteed in all scenarios with existing distributed technologies
10/29/18
8
THE SUBSEQUENT MONTHS.
1. We copied data from Netezza to HIVE
2. We created reports from Tableau with HIVE ODBC
3. We created a copy of HIVE into HBASE
4. We have HDP, but Cloudera supports Impala
5. MapReduce is slow
6. All data is not at one place
7. May be some more tools are needed
8. We need a unified data architecture solution
9. Rebalancing took entire week
10. Important file types are not splittable
11. 3 copies is too much space
12. Cost of maintenance is high
13. We may need to go to cloud
14. SLA not met
15. Too much operational work
10/29/18
9
WHAT MAKES ORGANIZATION FAMOUS?
CTO wants AI but AI is different from AI !!
AI : Autonomous systems which REPLACES human cognitive thought process
AI (IA): Autonomous systems which SUPPORTS human cognitive thought process
AlgorithmInput Output ?
Both needs machine learning and deep learning. These are means to do AI or IA
Algorithm ?
Input
Output
OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT
MAY TAKE HOURS/DAYS/MONTHS
7
Inputs Output Layer“Hidden” Layer(s)
FILE SYSTEMS.
• The problem is the file system. Traditional block-based file systems use
lookup tables to store file locations. They break each file up into small
blocks, generally 4k in size, and store the byte offset of each block in a
large table.
• This is fine for small volumes, but when you attempt to scale to the
petabyte range, these lookup tables become extremely large. It’s like
a database. The more rows you insert, the slower your queries
run. Eventually your performance degrades to the point where your
file system becomes unusable.
• When this happens, users are forced to split their data sets up into
multiple LUNs to maintain an acceptable level of performance. This
adds complexity and makes these systems difficult to manage
29/10/18
11
BLOCK BASED STORAGE SYSTEMS.
• To solve this problem, some organizations are deploying scale-out file
systems, like HDFS. This fixes the scalability problem, but keeping these
systems up and running is a labor-intensive process.
• Scale-out file systems are complex and require constant
maintenance. In addition, most of them rely on replication to protect
your data. The standard configuration is triple-replication, where you
store 3 copies of every file.
• This requires an extra 200% of raw disk capacity for
overhead! Everyone thinks that they’re saving money by using
commodity drives, but by the time you store three full copies of your
data set, the cost savings disappears. When we’re talking about
petabyte-scale applications, this is an expensive approach.
29/10/18
12
SOLUTION TO STORAGE.
• Object stores achieve their scalability by decoupling file management
from the low-level block management. Each disk is formatted with a
standard local file system, like ext4. Then a set of object storage
services is layered on top of it, combining everything into a single,
unified volume.
• Files are stored as “objects” in the object store rather than files on a
file system. By offloading the low-level block management onto the
local file systems, the object store only has to keep track of the high-
level details.
• This layer of separation keeps the file lookup tables at a manageable
size, allowing you scale to hundreds of petabytes without
experiencing degraded performance.
29/10/18
13
SOLUTION TO STORAGE.
• To maximize usable space, object stores use a technique called
Erasure Coding to protect your data. You can think of it as the next
generation of RAID.
• In an erasure coded volume, files are divided into shards, with each
shard being placed on a different disk. Additional shards are added,
containing error correction information, which provide protection from
data corruption and disk failures. Only a subset of the shards is
required to retrieve each file, which means it can survive multiple disk
failures without the risk of data loss.
• Erasure coded volumes can survive more disk failures than RAID and
typically provides more than double the usable capacity of triple
replication, making it the ideal choice for petabyte-scale storage.
29/10/18
14
MINIO - ERASURE CODING.
29/10/18
15
• EC is based on a technology called Forward Error Correction
(FEC), developed more than 50 years ago (1940- Richard
Hamming). Used originally for controlling errors in data
transmission over noisy or unreliable tele communication
channels. Reed-Solomon codes are a kind of EC, used widely in
CDs/DVDs, Blue Ray, Satellite commn etc.
• A message of k symbols can be transformed into a longer
message (code word or parity) with n symbols such that
the original message can be recovered from a subset of
the n symbols. If n=k+1, then there is a special case
called parity check
MINIO - ERASURE CODING.
29/10/18
16
TOP 5 : COST.
• https://amzn.to/2Q7AWGo
• S3: 23 USD per TB per month.(12.5 USD per TB for cold access)
• HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB
HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of
data. (Note that with reserved instances, it is possible to achieve lower
price on the d2 family.)
• S3 is 5X cheaper than HDFS.
• S3’s human cost is virtually zero, whereas it usually takes a team of
Hadoop engineers or vendor support to maintain HDFS. Once we
factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with
comparable capacity.
29/10/18
17
TOP 5 : ELASTICITY.
• From Databricks:
• 99.999999999% durability and 99.99% availability. Note that this is
higher than the vast majority of organizations’ in-house services.
• Majority of Hadoop clusters have availability lower than 99.9%, i.e. at
least 9 hours of downtime per year.
• With cross-AZ replication that automatically replicates across different
data centers, S3’s availability and durability is far superior to HDFS’.
• Hortonworks – Data Plane Services in 2019!
29/10/18
18
TOP 5 : PERFORMANCE.
• When using HDFS and getting perfect data locality, it is possible to get
~3GB/node local read throughput on some of the instance types (e.g.
i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization
module, provides optimized connectors to S3 and can sustain
~600MB/s read throughput on i2.8xl (roughly 20MB/s per core).
• That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS,
we find that S3 is almost 2x better compared to HDFS on performance
per dollar.
29/10/18
19
TOP 5 :TRANSACTIONS.
• Hadoop fs –mkdirs sample/a/b/c/
• Now you put the file into a/b/c
• Buckets…not directories
• In a Minio server instance, a single RESTful PUT request will create an
object “a/b/c/data.txt” in “mybucket” without having to create
“a/b/c” in advance
• This happens because object stores support hierarchical naming and
operations without the need for directories.
29/10/18
20
TOP 5 :TRANSACTIONS.
• Data Move is very interesting…
• What happens if you have a write code in Spark (saveAsTextFile) fils for
a partition ?
• Rename is atomic – the most critical part in Hadoop write flow
• Minio (or any object store) does not provide an atomic rename. In
fact, rename should be avoided in object storage altogether, since it
consists of two separate operations: copy and delete.
• Normal COPY is mapped to RESTful PUT request or RESTful COPY
request and triggers internal data movements between storage
nodes. The subsequent delete command maps to the RESTful DELETE
request, but usually relies on the bucket listing operation to identify
which data must be deleted. This makes a rename highly inefficient in
object stores, and the lack of atomicity may leave data in a
corrupted state.
29/10/18
21
TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
22
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
23
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
TOP 5 : DATA INTEGRITY - ELEGANT
SOLUTION FROM SPARK.
29/10/18
24
• Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio-
commit.html
SO HOW WILL IT LOOK LIKE?
29/10/18
25
COMPARISON.
MINIO
$
99.99 %
99.999999999$
DBIO
YES
HDFS
$$
99.9%
99.9999% (Estimated)
YES
NO
MINIO VS HDFS
10x
10x
10x
COMPARABLE
MINIO IS ELASTIC
10/29/18
26
FEATURE
COST/TB/
MONTH
AVLBLTY
DURABLE
WRITES
ELASTICITY
MINIO.
29/10/18
27
• High performance distributed Object Storage Server
• Simple, Efficient, Light weight and no learning curves
DEMO TIME
1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA
2) MINIO INTEROPERABILITY WITH HIVE
3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO
4) MINIO WITH SPARK - FILES
5) MINIO WITH SPARK – OBJECTS
6) MINIO WITH SEARCH
10/29/18
28
SUMMARY: EARN BY THIS ARCHITECTURE
10/29/18
29
THANK YOU!
Refer to:
https://blog.minio.io/modern-data-lake-with-minio-part-1-716a49499533
https://blog.minio.io/modern-data-lake-with-minio-part-2-f24fb5f82424
https://www.minio.io/
Apache Spark
Presto
10/29/18
30
QUESTIONS?

Building modern data lakes

  • 1.
    BUILDING MODERN DATA LAKES Minio,Spark and Unified Data Architecture work in unison By Ravi Shankar, October 2018 10/29/18 1
  • 2.
    FIRST ORDER LOGIC 10/29/18 2 First-orderlogic—also known as first-order predicate calculus and predicate logic - is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. Married("Harry", "Sally", "12-Dec-1995"). IsMotherOf("Sally", "Peter"). IsFatherOf("Harry", "Peter"). The Relational Model says that in your database this is how you think about and represent all your data There exists one or more X such that the marriage happened in 1995
  • 3.
    THE DATA MODELS. 10/29/18 3 subject-oriented,integrated, time- variant and non-volatile collection of data integrating data marts into a dimensional model for consumption
  • 4.
    PROBLEM STATEMENT. 10/29/18 4 Earlier NewDigitalization Initiatives !!! 1. Change everything 2. Keep as is. Add new relations 3. Move to:
  • 5.
    CTO GETS HADOOPIN. 10/29/18 5 1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
  • 6.
    10/29/18 6 ALL WENT SMOOTHUNTIL... A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map reduce program to process it File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication - total size about 26 GB Executed the application – What might have happened ? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado opRDD.scala
  • 7.
    10/29/18 7 SPLITTABILITY IMPORTANCE. So, performanceis not guaranteed in all scenarios with existing distributed technologies
  • 8.
    10/29/18 8 THE SUBSEQUENT MONTHS. 1.We copied data from Netezza to HIVE 2. We created reports from Tableau with HIVE ODBC 3. We created a copy of HIVE into HBASE 4. We have HDP, but Cloudera supports Impala 5. MapReduce is slow 6. All data is not at one place 7. May be some more tools are needed 8. We need a unified data architecture solution 9. Rebalancing took entire week 10. Important file types are not splittable 11. 3 copies is too much space 12. Cost of maintenance is high 13. We may need to go to cloud 14. SLA not met 15. Too much operational work
  • 9.
    10/29/18 9 WHAT MAKES ORGANIZATIONFAMOUS? CTO wants AI but AI is different from AI !! AI : Autonomous systems which REPLACES human cognitive thought process AI (IA): Autonomous systems which SUPPORTS human cognitive thought process AlgorithmInput Output ? Both needs machine learning and deep learning. These are means to do AI or IA Algorithm ? Input Output OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT MAY TAKE HOURS/DAYS/MONTHS
  • 10.
  • 11.
    FILE SYSTEMS. • Theproblem is the file system. Traditional block-based file systems use lookup tables to store file locations. They break each file up into small blocks, generally 4k in size, and store the byte offset of each block in a large table. • This is fine for small volumes, but when you attempt to scale to the petabyte range, these lookup tables become extremely large. It’s like a database. The more rows you insert, the slower your queries run. Eventually your performance degrades to the point where your file system becomes unusable. • When this happens, users are forced to split their data sets up into multiple LUNs to maintain an acceptable level of performance. This adds complexity and makes these systems difficult to manage 29/10/18 11
  • 12.
    BLOCK BASED STORAGESYSTEMS. • To solve this problem, some organizations are deploying scale-out file systems, like HDFS. This fixes the scalability problem, but keeping these systems up and running is a labor-intensive process. • Scale-out file systems are complex and require constant maintenance. In addition, most of them rely on replication to protect your data. The standard configuration is triple-replication, where you store 3 copies of every file. • This requires an extra 200% of raw disk capacity for overhead! Everyone thinks that they’re saving money by using commodity drives, but by the time you store three full copies of your data set, the cost savings disappears. When we’re talking about petabyte-scale applications, this is an expensive approach. 29/10/18 12
  • 13.
    SOLUTION TO STORAGE. •Object stores achieve their scalability by decoupling file management from the low-level block management. Each disk is formatted with a standard local file system, like ext4. Then a set of object storage services is layered on top of it, combining everything into a single, unified volume. • Files are stored as “objects” in the object store rather than files on a file system. By offloading the low-level block management onto the local file systems, the object store only has to keep track of the high- level details. • This layer of separation keeps the file lookup tables at a manageable size, allowing you scale to hundreds of petabytes without experiencing degraded performance. 29/10/18 13
  • 14.
    SOLUTION TO STORAGE. •To maximize usable space, object stores use a technique called Erasure Coding to protect your data. You can think of it as the next generation of RAID. • In an erasure coded volume, files are divided into shards, with each shard being placed on a different disk. Additional shards are added, containing error correction information, which provide protection from data corruption and disk failures. Only a subset of the shards is required to retrieve each file, which means it can survive multiple disk failures without the risk of data loss. • Erasure coded volumes can survive more disk failures than RAID and typically provides more than double the usable capacity of triple replication, making it the ideal choice for petabyte-scale storage. 29/10/18 14
  • 15.
    MINIO - ERASURECODING. 29/10/18 15 • EC is based on a technology called Forward Error Correction (FEC), developed more than 50 years ago (1940- Richard Hamming). Used originally for controlling errors in data transmission over noisy or unreliable tele communication channels. Reed-Solomon codes are a kind of EC, used widely in CDs/DVDs, Blue Ray, Satellite commn etc. • A message of k symbols can be transformed into a longer message (code word or parity) with n symbols such that the original message can be recovered from a subset of the n symbols. If n=k+1, then there is a special case called parity check
  • 16.
    MINIO - ERASURECODING. 29/10/18 16
  • 17.
    TOP 5 :COST. • https://amzn.to/2Q7AWGo • S3: 23 USD per TB per month.(12.5 USD per TB for cold access) • HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of data. (Note that with reserved instances, it is possible to achieve lower price on the d2 family.) • S3 is 5X cheaper than HDFS. • S3’s human cost is virtually zero, whereas it usually takes a team of Hadoop engineers or vendor support to maintain HDFS. Once we factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with comparable capacity. 29/10/18 17
  • 18.
    TOP 5 :ELASTICITY. • From Databricks: • 99.999999999% durability and 99.99% availability. Note that this is higher than the vast majority of organizations’ in-house services. • Majority of Hadoop clusters have availability lower than 99.9%, i.e. at least 9 hours of downtime per year. • With cross-AZ replication that automatically replicates across different data centers, S3’s availability and durability is far superior to HDFS’. • Hortonworks – Data Plane Services in 2019! 29/10/18 18
  • 19.
    TOP 5 :PERFORMANCE. • When using HDFS and getting perfect data locality, it is possible to get ~3GB/node local read throughput on some of the instance types (e.g. i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization module, provides optimized connectors to S3 and can sustain ~600MB/s read throughput on i2.8xl (roughly 20MB/s per core). • That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar. 29/10/18 19
  • 20.
    TOP 5 :TRANSACTIONS. •Hadoop fs –mkdirs sample/a/b/c/ • Now you put the file into a/b/c • Buckets…not directories • In a Minio server instance, a single RESTful PUT request will create an object “a/b/c/data.txt” in “mybucket” without having to create “a/b/c” in advance • This happens because object stores support hierarchical naming and operations without the need for directories. 29/10/18 20
  • 21.
    TOP 5 :TRANSACTIONS. •Data Move is very interesting… • What happens if you have a write code in Spark (saveAsTextFile) fils for a partition ? • Rename is atomic – the most critical part in Hadoop write flow • Minio (or any object store) does not provide an atomic rename. In fact, rename should be avoided in object storage altogether, since it consists of two separate operations: copy and delete. • Normal COPY is mapped to RESTful PUT request or RESTful COPY request and triggers internal data movements between storage nodes. The subsequent delete command maps to the RESTful DELETE request, but usually relies on the bucket listing operation to identify which data must be deleted. This makes a rename highly inefficient in object stores, and the lack of atomicity may leave data in a corrupted state. 29/10/18 21
  • 22.
    TOP 5 :TRANSACTIONS:PERFORMANCE. 29/10/18 22 • version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete.
  • 23.
    TOP 5 :TRANSACTIONS:PERFORMANCE. 29/10/18 23 • version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete.
  • 24.
    TOP 5 :DATA INTEGRITY - ELEGANT SOLUTION FROM SPARK. 29/10/18 24 • Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio- commit.html
  • 25.
    SO HOW WILLIT LOOK LIKE? 29/10/18 25
  • 26.
    COMPARISON. MINIO $ 99.99 % 99.999999999$ DBIO YES HDFS $$ 99.9% 99.9999% (Estimated) YES NO MINIOVS HDFS 10x 10x 10x COMPARABLE MINIO IS ELASTIC 10/29/18 26 FEATURE COST/TB/ MONTH AVLBLTY DURABLE WRITES ELASTICITY
  • 27.
    MINIO. 29/10/18 27 • High performancedistributed Object Storage Server • Simple, Efficient, Light weight and no learning curves
  • 28.
    DEMO TIME 1) MINIOINTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA 2) MINIO INTEROPERABILITY WITH HIVE 3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO 4) MINIO WITH SPARK - FILES 5) MINIO WITH SPARK – OBJECTS 6) MINIO WITH SEARCH 10/29/18 28
  • 29.
    SUMMARY: EARN BYTHIS ARCHITECTURE 10/29/18 29
  • 30.