SlideShare a Scribd company logo
1 of 29
Download to read offline
Agile Data
Lake?
An Oxymoron?
Agenda
1. Part 1 - Data Lake Overview
2. Part 2 - Technology Deep Dive
Please interrupt with questions/comments so I know which slides to focus on.
Part 1 - Data Lake Overview
Data Lake - Definition
Data Lake - Definition - Martin Kleppman
"Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further.
By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data
into the database's proprietary storage format.
From a purist's point of view, it may seem that this careful modeling and import is desirable, because it means users of the
database have better-quality data to work with. However, in practice, it appears that simply making data available quickly - even if it
is in a quirky, difficult-to-use, raw format [/schema] - is often more valuable than trying to decide on the ideal data model up front.
The idea is similar to a data warehouse: simply bringing data from various parts of a large organization together in one place is
valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP
database slows down that centralised data collection; collecting data in it's raw form, and worrying about schema design later,
allows the data collection to be speeded up (a concept sometimes known as a "data lake" or "enterprise data hub").
... the interpretation of the data becomes the consumer's problem (the schema-on-read approach). ... There may not even be one
ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw
form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better".
Page 415, 416, Martin Kleppmann - Designing Data-Intensive Applications
Data Lake - Definition - Martin Fowler
> The idea [Data Lake] is to have a single store for all of the raw data that anyone in an organization might
need to analyze.
...
> But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw
data, in whatever form the data source provides. There is no assumptions about the schema of the data,
each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of
that data for their own purposes.
> data put into the lake is immutable
> The data lake is “schemaless” [or Schema-on-Read]
> storage is oriented around the notion of a large schemaless structure … HDFS
https://martinfowler.com/bliki/DataLake.html ← MUST READ!
Prefer this
Individuals and
interactions /engineers
Over this
Pretty GUI based Tools,
Deskilling,
Centralisation
and Processes
Trends (Sample = 11 Data Lakes)
Good Bad
Cost < £1m, business value in weeks/months Cost many millions, years before business value
Schema on Read
Documentation as code - Internal Open Source
Schema on Write
Metastores, data dictionaries, confluence
Cloud, PAAS, e.g. EMR, Dataproc
S3 for long term storage
On prem, Cloudera/Hortonworks
HDFS for long term storage
Scala / Java apps, Jenkins, CircleCi, etc with Bash
for deployment and lightweight scheduling
Hive / Pig scripts, manual releases, heavyweight
scheduling (Oozie, Airflow, Control-M, Luigi, etc)
High ratio 80%+; committed developers/engineers
that write code.
Small in house teams, highly skilled.
High ratio of involved people who do not commit
code.
Large low skilled offshore teams
Flat structure, cross-functional teams
Agile
Hierarchical, authoritarian
Waterfall
Trends (Sample = 11 Data Lakes)
Success Failure
XP, KISS, YAGNI BUFD, Tools, Processes, Governance, Complexity,
Documentation
Cross functional individuals (can architect, code &
do analysis) form a team that can deliver end to
end business value right from source to
consumption.
Co-dependent component teams, no one team can
deliver an end to end solution.
Clear focus on 1 business problem, solve it, then
solve 2nd business problem, solve it, then
deduplicate (DRY)
No clear business focus, too many goals, lofty
overly ambitious ideas, silver bullets, big hammers
Motivation - The WHY:
Satisfaction from solving problems & automation
Motivation - The WHY:
Deskilling & centralisation of power
Hive, Impala, Drill, Dremio,
Presto, Delta, Athena, Kylo, Hudi,
Ab Initio, etc
Silver Bullets & Big Hammers
- Often built to demo/pitch to Architects (that don’t code) & non-technical/non-engineers
- Consequently have well polished UIs but often lacking quality under the hood
- Generally only handle happy cases
- Tend to assume all use cases are the same. Your use case will probably invalidate their assumptions
- The devil is in the details, and they obscure those details
- Generally make performance problems more complicated due to inherited and obscured complexity
- Often commercially motivated
- Few engineers/data scientists would recommend as they know what it's really like to build a Data Lake
and know that most of these tools won't work
- Often aim at deskilling, literally claiming that Engineers/Data Scientists are not necessary. They are
necessary, but now you have to pay a vendor/consultancy for those skills
- at very high markup
- with a lot of lost in translation issues and communication bottlenecks
- long delays in implementation
- Generally appeal to non-technical people that want centralisation and power, some tools literally
referring to users as “power users”
Note that there are exceptions, for example Uber’s Hudi seems to be built to solve real internal PB data problems, then later Open Sourced. There may be other exceptions.
Data Lake Principles
1. Immutability & Reproducibility - Datasets should be immutable, Any
queries/jobs run on the Data Lake should be reproducible
2. A Dataset corresponds to a directory and all the files in that directory, not files
- Big Data is too big to fit into single files. Avoid appending to a directory as
this is just like mutating it, thus violating 1.
3. An easy way to identify when new data has arrived - no scanning, no joining,
or complex event notification systems should be necessary. Simply partition
by landed date and consumers keep track of their own offsets (like in Kafka)
4. Schema On Read - Parquet headers plus directory structure form self
describing metadata (more next!).
Metadata
- Schema-on-read - parquet header has the schema
- Add lineage fields to data at each stage of a pipeline, especially later stages
- Internal Open Source via Monorepo
- Code is unambiguous
- Invest in high quality code control - Stop here!
- Analogy:
- An enterprise investing large amounts in meta-data services is like a restaurant investing large
amounts in menus
- In the best restaurants chefs write the specials of the day on a blackboard
- In the best enterprises innovation and code is created every day
- etc
Technology Choices - Analytics
Requirement Recommendation
SQL Interface to Parquet & Databases (via JDBC) Apache Zeppelin
Spark, Scala, Java, Python, R, CLI, and more Apache Zeppelin
Charts, graphs & visualisations (inc JS, D3 etc) Apache Zeppelin
Free and open source Apache Zeppelin
Integrated into infra (EMR, HDInsights) out-of-box - NoOps Apache Zeppelin
Lightweight scheduling and job management Apache Zeppelin
Basic Source Control & JSON Exports Apache Zeppelin
In memory compute (via Spark) Apache Zeppelin
Quickly implement dashboards & reports via WYSIWYG or JS Apache Zeppelin
Active Directory & ACL integration Apache Zeppelin
Technology Choices - Software
Requirement Recommendation
Parquet Format Spark
Production quality stable Spark APIs Scala/Java
Streaming Architecture on Kafka Scala/Java
Quick development & Dev Cycle Statically typed languages
Production quality software / low bug density Statically typed languages
Huge market of low skilled cheap resource
where speed of delivery, software quality and
data quality is not important
(please read The Mythical Man-Month!)
Python
https://insights.stackoverflow.com/survey/2019?utm_source=Iterable&utm_medium=email&utm_campaign=dev-survey-2019#top-paying-technologies
Conclusion
Yes you can build a Data Lake in an Agile way.
● Code first
● Upskill, not deskill
● Do not trust all vendor marketing literature & blogs
● Avoid most big tools, especially proprietary ones
Part 2 - Technology Deep Dive
Brief History of Spark
Version & Date Notes
Up to 0.5, 2010 - 2012 Spark created as a better Hadoop MapReduce.
- Awesome functional typed Scala API (RDD)
- In memory caching
- Broadcast variables
- Mesos support
0.6, 14/10/2012 - Java API (anticipating Java 8 & Scala 2.12 interop!)
0.7, 27/02/2013 - Python API: PySpark
- Spark”Streaming” Alpha
0.8, 25/09/2013 - Apache Incubator in June 2013
- September 2013, Databricks raises $13.9 million
- MLlib (nice idea, poor API) see https://github.com/samthebest/sceval/blob/master/README.md
0.9 - 1.6, 02/02/2014 - 04/01/2016
Hype years!
- February 2014, Spark becomes Top-Level Apache Project
- SparkSQL, and more Shiny (GraphX, Dataframe API, SparkR)
- Covariant RDDs requested
https://issues.apache.org/jira/browse/SPARK-1296
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
Brief History of Spark
Version & Date Notes
2.0, 26/07/2016 Datasets API (nice idea, poor design):
- Typed, semi declarative class based API
- Improved serialisation
- No way to inject custom serialisation
StructuredStreaming API
- Uses same API as Datasets (so what are .cache, .filter, .mapPartitions supposed
to do? How do we branch? How to access a microbatch? How to control
microbatch sizes? etc)
2.3, 28/02/2018 StructuredStreaming trying to play catch up with Kafka Streams, Akka Streams, etc
???, 2500? - Increase parallelism without shuffling https://issues.apache.org/jira/browse/SPARK-5997
- Num partitions no longer respects num files
https://issues.apache.org/jira/browse/SPARK-24425
- Multiple SparkContexts https://issues.apache.org/jira/browse/SPARK-2243
- Closure cleaner bugs https://issues.apache.org/jira/browse/SPARK-26534
- Spores https://docs.scala-lang.org/sips/spores.html
- RDD covariance https://issues.apache.org/jira/browse/SPARK-1296
- Frameless to become native? https://github.com/typelevel/frameless
- Datasets to offer injectable custom serialisation based on this
https://typelevel.org/frameless/Injection.html
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
Spark APIs - RDD
- Motivated by true Open Source & Unix philosophy - solve a specific real
problem well, simply and flexibly
- Oldest most stable API, has very few bugs
- Boils down to two functions that neatly correspond to MapReduce paradigm:
- `mapPartitions`
- `combineByKey`
- Simple flexible API design
- Can customise serialisation (using `mapParitions` and byte arrays)
- Can customise reading and writing (e.g. `binaryFiles`)
- Fairly functional, but does mutate state (e.g. `.cache()`)
- Advised API for experienced developers / data engineers, especially in the Big
Data space
Spark APIs - Dataset / Dataframe
- Motivated by increasing market size for vendors by targeting non-developers,
e.g. Analysts, Data Scientists and Architects
- Very buggy, e.g. (bugs I found in the last couple of months)
- Reading: nulling entire rows, parsing timestamps incorrectly, not handling escaping properly, etc
- Non-optional reference types are treated as nullable
- Closure cleaner seems more buggy with Datasets (unnecessarily serialises)
- API Design inflexible
- cannot inject custom serialisation
- No functional `combineByKey` API counterpart, have to instantiate an Aggregator
- Declarative API breaks MapReduce semantics E.g.
- A call to `groupBy` may not actually cause a groupby operation
- Advised API for those new to Big Data and generally trying to solve
“little/middle data” problems (i.e. extensive optimisations are not necessary),
and where data quality and application stability less important (e.g. POCs).
Spark APIs - SparkSQL
- Buggy, unstable, unpredictable
- SQL optimiser is quite immature
- MapReduce is a functional paradigm while SQL is declarative, consequently these
two don’t get along very well
- All the usual problems with SQL; hard to test, no compiler, not turing complete, etc
- Advised API for interactive analytical use only - never use for production
applications!
Frameless - Awesome!
- All of the benefits of Datasets without string literals
scala> fds.filter(fds('i) === 10).select(fds('x))
<console>:24: error: No column Symbol with shapeless.tag.Tagged[String("x")] of type A in Foo
fds.filter(fds('i) === 10).select(fds('x))
^
- Custom serialisation https://typelevel.org/frameless/Injection.html
- Cats integration, e.g. can join RDDs using `|+|`
- Advised API for both experienced Big Data Engineers and people new to Big
Data
Alternatives to Spark
- Kafka: see Kafka Streams & Akka Streams
- Flink

More Related Content

What's hot

HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 

What's hot (20)

Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Allyourbase
AllyourbaseAllyourbase
Allyourbase
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 

Similar to Agile data lake? An oxymoron?

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopImpetus Technologies
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungRendy Bambang Junior
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 

Similar to Agile data lake? An oxymoron? (20)

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 

Recently uploaded

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 

Recently uploaded (20)

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 

Agile data lake? An oxymoron?

  • 2. Agenda 1. Part 1 - Data Lake Overview 2. Part 2 - Technology Deep Dive Please interrupt with questions/comments so I know which slides to focus on.
  • 3. Part 1 - Data Lake Overview
  • 4. Data Lake - Definition
  • 5.
  • 6. Data Lake - Definition - Martin Kleppman "Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further. By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data into the database's proprietary storage format. From a purist's point of view, it may seem that this careful modeling and import is desirable, because it means users of the database have better-quality data to work with. However, in practice, it appears that simply making data available quickly - even if it is in a quirky, difficult-to-use, raw format [/schema] - is often more valuable than trying to decide on the ideal data model up front. The idea is similar to a data warehouse: simply bringing data from various parts of a large organization together in one place is valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP database slows down that centralised data collection; collecting data in it's raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a "data lake" or "enterprise data hub"). ... the interpretation of the data becomes the consumer's problem (the schema-on-read approach). ... There may not even be one ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better". Page 415, 416, Martin Kleppmann - Designing Data-Intensive Applications
  • 7. Data Lake - Definition - Martin Fowler > The idea [Data Lake] is to have a single store for all of the raw data that anyone in an organization might need to analyze. ... > But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw data, in whatever form the data source provides. There is no assumptions about the schema of the data, each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of that data for their own purposes. > data put into the lake is immutable > The data lake is “schemaless” [or Schema-on-Read] > storage is oriented around the notion of a large schemaless structure … HDFS https://martinfowler.com/bliki/DataLake.html ← MUST READ!
  • 8.
  • 10. Over this Pretty GUI based Tools, Deskilling, Centralisation and Processes
  • 11. Trends (Sample = 11 Data Lakes) Good Bad Cost < £1m, business value in weeks/months Cost many millions, years before business value Schema on Read Documentation as code - Internal Open Source Schema on Write Metastores, data dictionaries, confluence Cloud, PAAS, e.g. EMR, Dataproc S3 for long term storage On prem, Cloudera/Hortonworks HDFS for long term storage Scala / Java apps, Jenkins, CircleCi, etc with Bash for deployment and lightweight scheduling Hive / Pig scripts, manual releases, heavyweight scheduling (Oozie, Airflow, Control-M, Luigi, etc) High ratio 80%+; committed developers/engineers that write code. Small in house teams, highly skilled. High ratio of involved people who do not commit code. Large low skilled offshore teams Flat structure, cross-functional teams Agile Hierarchical, authoritarian Waterfall
  • 12. Trends (Sample = 11 Data Lakes) Success Failure XP, KISS, YAGNI BUFD, Tools, Processes, Governance, Complexity, Documentation Cross functional individuals (can architect, code & do analysis) form a team that can deliver end to end business value right from source to consumption. Co-dependent component teams, no one team can deliver an end to end solution. Clear focus on 1 business problem, solve it, then solve 2nd business problem, solve it, then deduplicate (DRY) No clear business focus, too many goals, lofty overly ambitious ideas, silver bullets, big hammers Motivation - The WHY: Satisfaction from solving problems & automation Motivation - The WHY: Deskilling & centralisation of power
  • 13. Hive, Impala, Drill, Dremio, Presto, Delta, Athena, Kylo, Hudi, Ab Initio, etc
  • 14. Silver Bullets & Big Hammers - Often built to demo/pitch to Architects (that don’t code) & non-technical/non-engineers - Consequently have well polished UIs but often lacking quality under the hood - Generally only handle happy cases - Tend to assume all use cases are the same. Your use case will probably invalidate their assumptions - The devil is in the details, and they obscure those details - Generally make performance problems more complicated due to inherited and obscured complexity - Often commercially motivated - Few engineers/data scientists would recommend as they know what it's really like to build a Data Lake and know that most of these tools won't work - Often aim at deskilling, literally claiming that Engineers/Data Scientists are not necessary. They are necessary, but now you have to pay a vendor/consultancy for those skills - at very high markup - with a lot of lost in translation issues and communication bottlenecks - long delays in implementation - Generally appeal to non-technical people that want centralisation and power, some tools literally referring to users as “power users” Note that there are exceptions, for example Uber’s Hudi seems to be built to solve real internal PB data problems, then later Open Sourced. There may be other exceptions.
  • 15. Data Lake Principles 1. Immutability & Reproducibility - Datasets should be immutable, Any queries/jobs run on the Data Lake should be reproducible 2. A Dataset corresponds to a directory and all the files in that directory, not files - Big Data is too big to fit into single files. Avoid appending to a directory as this is just like mutating it, thus violating 1. 3. An easy way to identify when new data has arrived - no scanning, no joining, or complex event notification systems should be necessary. Simply partition by landed date and consumers keep track of their own offsets (like in Kafka) 4. Schema On Read - Parquet headers plus directory structure form self describing metadata (more next!).
  • 16. Metadata - Schema-on-read - parquet header has the schema - Add lineage fields to data at each stage of a pipeline, especially later stages - Internal Open Source via Monorepo - Code is unambiguous - Invest in high quality code control - Stop here! - Analogy: - An enterprise investing large amounts in meta-data services is like a restaurant investing large amounts in menus - In the best restaurants chefs write the specials of the day on a blackboard - In the best enterprises innovation and code is created every day - etc
  • 17. Technology Choices - Analytics Requirement Recommendation SQL Interface to Parquet & Databases (via JDBC) Apache Zeppelin Spark, Scala, Java, Python, R, CLI, and more Apache Zeppelin Charts, graphs & visualisations (inc JS, D3 etc) Apache Zeppelin Free and open source Apache Zeppelin Integrated into infra (EMR, HDInsights) out-of-box - NoOps Apache Zeppelin Lightweight scheduling and job management Apache Zeppelin Basic Source Control & JSON Exports Apache Zeppelin In memory compute (via Spark) Apache Zeppelin Quickly implement dashboards & reports via WYSIWYG or JS Apache Zeppelin Active Directory & ACL integration Apache Zeppelin
  • 18. Technology Choices - Software Requirement Recommendation Parquet Format Spark Production quality stable Spark APIs Scala/Java Streaming Architecture on Kafka Scala/Java Quick development & Dev Cycle Statically typed languages Production quality software / low bug density Statically typed languages Huge market of low skilled cheap resource where speed of delivery, software quality and data quality is not important (please read The Mythical Man-Month!) Python https://insights.stackoverflow.com/survey/2019?utm_source=Iterable&utm_medium=email&utm_campaign=dev-survey-2019#top-paying-technologies
  • 19.
  • 20.
  • 21. Conclusion Yes you can build a Data Lake in an Agile way. ● Code first ● Upskill, not deskill ● Do not trust all vendor marketing literature & blogs ● Avoid most big tools, especially proprietary ones
  • 22. Part 2 - Technology Deep Dive
  • 23. Brief History of Spark Version & Date Notes Up to 0.5, 2010 - 2012 Spark created as a better Hadoop MapReduce. - Awesome functional typed Scala API (RDD) - In memory caching - Broadcast variables - Mesos support 0.6, 14/10/2012 - Java API (anticipating Java 8 & Scala 2.12 interop!) 0.7, 27/02/2013 - Python API: PySpark - Spark”Streaming” Alpha 0.8, 25/09/2013 - Apache Incubator in June 2013 - September 2013, Databricks raises $13.9 million - MLlib (nice idea, poor API) see https://github.com/samthebest/sceval/blob/master/README.md 0.9 - 1.6, 02/02/2014 - 04/01/2016 Hype years! - February 2014, Spark becomes Top-Level Apache Project - SparkSQL, and more Shiny (GraphX, Dataframe API, SparkR) - Covariant RDDs requested https://issues.apache.org/jira/browse/SPARK-1296 Key: Good idea - technically motivated, Not so good idea (probably commercially motivated?)
  • 24. Brief History of Spark Version & Date Notes 2.0, 26/07/2016 Datasets API (nice idea, poor design): - Typed, semi declarative class based API - Improved serialisation - No way to inject custom serialisation StructuredStreaming API - Uses same API as Datasets (so what are .cache, .filter, .mapPartitions supposed to do? How do we branch? How to access a microbatch? How to control microbatch sizes? etc) 2.3, 28/02/2018 StructuredStreaming trying to play catch up with Kafka Streams, Akka Streams, etc ???, 2500? - Increase parallelism without shuffling https://issues.apache.org/jira/browse/SPARK-5997 - Num partitions no longer respects num files https://issues.apache.org/jira/browse/SPARK-24425 - Multiple SparkContexts https://issues.apache.org/jira/browse/SPARK-2243 - Closure cleaner bugs https://issues.apache.org/jira/browse/SPARK-26534 - Spores https://docs.scala-lang.org/sips/spores.html - RDD covariance https://issues.apache.org/jira/browse/SPARK-1296 - Frameless to become native? https://github.com/typelevel/frameless - Datasets to offer injectable custom serialisation based on this https://typelevel.org/frameless/Injection.html Key: Good idea - technically motivated, Not so good idea (probably commercially motivated?)
  • 25. Spark APIs - RDD - Motivated by true Open Source & Unix philosophy - solve a specific real problem well, simply and flexibly - Oldest most stable API, has very few bugs - Boils down to two functions that neatly correspond to MapReduce paradigm: - `mapPartitions` - `combineByKey` - Simple flexible API design - Can customise serialisation (using `mapParitions` and byte arrays) - Can customise reading and writing (e.g. `binaryFiles`) - Fairly functional, but does mutate state (e.g. `.cache()`) - Advised API for experienced developers / data engineers, especially in the Big Data space
  • 26. Spark APIs - Dataset / Dataframe - Motivated by increasing market size for vendors by targeting non-developers, e.g. Analysts, Data Scientists and Architects - Very buggy, e.g. (bugs I found in the last couple of months) - Reading: nulling entire rows, parsing timestamps incorrectly, not handling escaping properly, etc - Non-optional reference types are treated as nullable - Closure cleaner seems more buggy with Datasets (unnecessarily serialises) - API Design inflexible - cannot inject custom serialisation - No functional `combineByKey` API counterpart, have to instantiate an Aggregator - Declarative API breaks MapReduce semantics E.g. - A call to `groupBy` may not actually cause a groupby operation - Advised API for those new to Big Data and generally trying to solve “little/middle data” problems (i.e. extensive optimisations are not necessary), and where data quality and application stability less important (e.g. POCs).
  • 27. Spark APIs - SparkSQL - Buggy, unstable, unpredictable - SQL optimiser is quite immature - MapReduce is a functional paradigm while SQL is declarative, consequently these two don’t get along very well - All the usual problems with SQL; hard to test, no compiler, not turing complete, etc - Advised API for interactive analytical use only - never use for production applications!
  • 28. Frameless - Awesome! - All of the benefits of Datasets without string literals scala> fds.filter(fds('i) === 10).select(fds('x)) <console>:24: error: No column Symbol with shapeless.tag.Tagged[String("x")] of type A in Foo fds.filter(fds('i) === 10).select(fds('x)) ^ - Custom serialisation https://typelevel.org/frameless/Injection.html - Cats integration, e.g. can join RDDs using `|+|` - Advised API for both experienced Big Data Engineers and people new to Big Data
  • 29. Alternatives to Spark - Kafka: see Kafka Streams & Akka Streams - Flink