6. Data Lake - Definition - Martin Kleppman
"Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further.
By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data
into the database's proprietary storage format.
From a purist's point of view, it may seem that this careful modeling and import is desirable, because it means users of the
database have better-quality data to work with. However, in practice, it appears that simply making data available quickly - even if it
is in a quirky, difficult-to-use, raw format [/schema] - is often more valuable than trying to decide on the ideal data model up front.
The idea is similar to a data warehouse: simply bringing data from various parts of a large organization together in one place is
valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP
database slows down that centralised data collection; collecting data in it's raw form, and worrying about schema design later,
allows the data collection to be speeded up (a concept sometimes known as a "data lake" or "enterprise data hub").
... the interpretation of the data becomes the consumer's problem (the schema-on-read approach). ... There may not even be one
ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw
form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better".
Page 415, 416, Martin Kleppmann - Designing Data-Intensive Applications
7. Data Lake - Definition - Martin Fowler
> The idea [Data Lake] is to have a single store for all of the raw data that anyone in an organization might
need to analyze.
...
> But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw
data, in whatever form the data source provides. There is no assumptions about the schema of the data,
each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of
that data for their own purposes.
> data put into the lake is immutable
> The data lake is “schemaless” [or Schema-on-Read]
> storage is oriented around the notion of a large schemaless structure … HDFS
https://martinfowler.com/bliki/DataLake.html ← MUST READ!
11. Trends (Sample = 11 Data Lakes)
Good Bad
Cost < £1m, business value in weeks/months Cost many millions, years before business value
Schema on Read
Documentation as code - Internal Open Source
Schema on Write
Metastores, data dictionaries, confluence
Cloud, PAAS, e.g. EMR, Dataproc
S3 for long term storage
On prem, Cloudera/Hortonworks
HDFS for long term storage
Scala / Java apps, Jenkins, CircleCi, etc with Bash
for deployment and lightweight scheduling
Hive / Pig scripts, manual releases, heavyweight
scheduling (Oozie, Airflow, Control-M, Luigi, etc)
High ratio 80%+; committed developers/engineers
that write code.
Small in house teams, highly skilled.
High ratio of involved people who do not commit
code.
Large low skilled offshore teams
Flat structure, cross-functional teams
Agile
Hierarchical, authoritarian
Waterfall
12. Trends (Sample = 11 Data Lakes)
Success Failure
XP, KISS, YAGNI BUFD, Tools, Processes, Governance, Complexity,
Documentation
Cross functional individuals (can architect, code &
do analysis) form a team that can deliver end to
end business value right from source to
consumption.
Co-dependent component teams, no one team can
deliver an end to end solution.
Clear focus on 1 business problem, solve it, then
solve 2nd business problem, solve it, then
deduplicate (DRY)
No clear business focus, too many goals, lofty
overly ambitious ideas, silver bullets, big hammers
Motivation - The WHY:
Satisfaction from solving problems & automation
Motivation - The WHY:
Deskilling & centralisation of power
14. Silver Bullets & Big Hammers
- Often built to demo/pitch to Architects (that don’t code) & non-technical/non-engineers
- Consequently have well polished UIs but often lacking quality under the hood
- Generally only handle happy cases
- Tend to assume all use cases are the same. Your use case will probably invalidate their assumptions
- The devil is in the details, and they obscure those details
- Generally make performance problems more complicated due to inherited and obscured complexity
- Often commercially motivated
- Few engineers/data scientists would recommend as they know what it's really like to build a Data Lake
and know that most of these tools won't work
- Often aim at deskilling, literally claiming that Engineers/Data Scientists are not necessary. They are
necessary, but now you have to pay a vendor/consultancy for those skills
- at very high markup
- with a lot of lost in translation issues and communication bottlenecks
- long delays in implementation
- Generally appeal to non-technical people that want centralisation and power, some tools literally
referring to users as “power users”
Note that there are exceptions, for example Uber’s Hudi seems to be built to solve real internal PB data problems, then later Open Sourced. There may be other exceptions.
15. Data Lake Principles
1. Immutability & Reproducibility - Datasets should be immutable, Any
queries/jobs run on the Data Lake should be reproducible
2. A Dataset corresponds to a directory and all the files in that directory, not files
- Big Data is too big to fit into single files. Avoid appending to a directory as
this is just like mutating it, thus violating 1.
3. An easy way to identify when new data has arrived - no scanning, no joining,
or complex event notification systems should be necessary. Simply partition
by landed date and consumers keep track of their own offsets (like in Kafka)
4. Schema On Read - Parquet headers plus directory structure form self
describing metadata (more next!).
16. Metadata
- Schema-on-read - parquet header has the schema
- Add lineage fields to data at each stage of a pipeline, especially later stages
- Internal Open Source via Monorepo
- Code is unambiguous
- Invest in high quality code control - Stop here!
- Analogy:
- An enterprise investing large amounts in meta-data services is like a restaurant investing large
amounts in menus
- In the best restaurants chefs write the specials of the day on a blackboard
- In the best enterprises innovation and code is created every day
- etc
17. Technology Choices - Analytics
Requirement Recommendation
SQL Interface to Parquet & Databases (via JDBC) Apache Zeppelin
Spark, Scala, Java, Python, R, CLI, and more Apache Zeppelin
Charts, graphs & visualisations (inc JS, D3 etc) Apache Zeppelin
Free and open source Apache Zeppelin
Integrated into infra (EMR, HDInsights) out-of-box - NoOps Apache Zeppelin
Lightweight scheduling and job management Apache Zeppelin
Basic Source Control & JSON Exports Apache Zeppelin
In memory compute (via Spark) Apache Zeppelin
Quickly implement dashboards & reports via WYSIWYG or JS Apache Zeppelin
Active Directory & ACL integration Apache Zeppelin
18. Technology Choices - Software
Requirement Recommendation
Parquet Format Spark
Production quality stable Spark APIs Scala/Java
Streaming Architecture on Kafka Scala/Java
Quick development & Dev Cycle Statically typed languages
Production quality software / low bug density Statically typed languages
Huge market of low skilled cheap resource
where speed of delivery, software quality and
data quality is not important
(please read The Mythical Man-Month!)
Python
https://insights.stackoverflow.com/survey/2019?utm_source=Iterable&utm_medium=email&utm_campaign=dev-survey-2019#top-paying-technologies
19.
20.
21. Conclusion
Yes you can build a Data Lake in an Agile way.
● Code first
● Upskill, not deskill
● Do not trust all vendor marketing literature & blogs
● Avoid most big tools, especially proprietary ones
23. Brief History of Spark
Version & Date Notes
Up to 0.5, 2010 - 2012 Spark created as a better Hadoop MapReduce.
- Awesome functional typed Scala API (RDD)
- In memory caching
- Broadcast variables
- Mesos support
0.6, 14/10/2012 - Java API (anticipating Java 8 & Scala 2.12 interop!)
0.7, 27/02/2013 - Python API: PySpark
- Spark”Streaming” Alpha
0.8, 25/09/2013 - Apache Incubator in June 2013
- September 2013, Databricks raises $13.9 million
- MLlib (nice idea, poor API) see https://github.com/samthebest/sceval/blob/master/README.md
0.9 - 1.6, 02/02/2014 - 04/01/2016
Hype years!
- February 2014, Spark becomes Top-Level Apache Project
- SparkSQL, and more Shiny (GraphX, Dataframe API, SparkR)
- Covariant RDDs requested
https://issues.apache.org/jira/browse/SPARK-1296
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
24. Brief History of Spark
Version & Date Notes
2.0, 26/07/2016 Datasets API (nice idea, poor design):
- Typed, semi declarative class based API
- Improved serialisation
- No way to inject custom serialisation
StructuredStreaming API
- Uses same API as Datasets (so what are .cache, .filter, .mapPartitions supposed
to do? How do we branch? How to access a microbatch? How to control
microbatch sizes? etc)
2.3, 28/02/2018 StructuredStreaming trying to play catch up with Kafka Streams, Akka Streams, etc
???, 2500? - Increase parallelism without shuffling https://issues.apache.org/jira/browse/SPARK-5997
- Num partitions no longer respects num files
https://issues.apache.org/jira/browse/SPARK-24425
- Multiple SparkContexts https://issues.apache.org/jira/browse/SPARK-2243
- Closure cleaner bugs https://issues.apache.org/jira/browse/SPARK-26534
- Spores https://docs.scala-lang.org/sips/spores.html
- RDD covariance https://issues.apache.org/jira/browse/SPARK-1296
- Frameless to become native? https://github.com/typelevel/frameless
- Datasets to offer injectable custom serialisation based on this
https://typelevel.org/frameless/Injection.html
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
25. Spark APIs - RDD
- Motivated by true Open Source & Unix philosophy - solve a specific real
problem well, simply and flexibly
- Oldest most stable API, has very few bugs
- Boils down to two functions that neatly correspond to MapReduce paradigm:
- `mapPartitions`
- `combineByKey`
- Simple flexible API design
- Can customise serialisation (using `mapParitions` and byte arrays)
- Can customise reading and writing (e.g. `binaryFiles`)
- Fairly functional, but does mutate state (e.g. `.cache()`)
- Advised API for experienced developers / data engineers, especially in the Big
Data space
26. Spark APIs - Dataset / Dataframe
- Motivated by increasing market size for vendors by targeting non-developers,
e.g. Analysts, Data Scientists and Architects
- Very buggy, e.g. (bugs I found in the last couple of months)
- Reading: nulling entire rows, parsing timestamps incorrectly, not handling escaping properly, etc
- Non-optional reference types are treated as nullable
- Closure cleaner seems more buggy with Datasets (unnecessarily serialises)
- API Design inflexible
- cannot inject custom serialisation
- No functional `combineByKey` API counterpart, have to instantiate an Aggregator
- Declarative API breaks MapReduce semantics E.g.
- A call to `groupBy` may not actually cause a groupby operation
- Advised API for those new to Big Data and generally trying to solve
“little/middle data” problems (i.e. extensive optimisations are not necessary),
and where data quality and application stability less important (e.g. POCs).
27. Spark APIs - SparkSQL
- Buggy, unstable, unpredictable
- SQL optimiser is quite immature
- MapReduce is a functional paradigm while SQL is declarative, consequently these
two don’t get along very well
- All the usual problems with SQL; hard to test, no compiler, not turing complete, etc
- Advised API for interactive analytical use only - never use for production
applications!
28. Frameless - Awesome!
- All of the benefits of Datasets without string literals
scala> fds.filter(fds('i) === 10).select(fds('x))
<console>:24: error: No column Symbol with shapeless.tag.Tagged[String("x")] of type A in Foo
fds.filter(fds('i) === 10).select(fds('x))
^
- Custom serialisation https://typelevel.org/frameless/Injection.html
- Cats integration, e.g. can join RDDs using `|+|`
- Advised API for both experienced Big Data Engineers and people new to Big
Data