Hadoop's Problem and How to Fix it

Hadoop’s biggest
problem and how to
fix it
Mark Chopping, COO, Kognitio
March 2017
1

How the mighty elephant has fallen
Recently there has been an increase in headlines critical of Hadoop
“Hadoop has failed us, Tech Experts say”
https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/
“You’re doing Hadoop and Spark wrong, and they will probably fail”
https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/
“Has Hadoop failed? That’s the wrong question”
http://www.podiumdata.com/blog/has-hadoop-failed-s-wrong-question
2

What’s behind the headlines?
3
Hadoop is ill-suited for
running interactive, user-
facing applications.
The once “hot new
technology” is 10x slower
than the data warehouses
being used before.
Hadoop is a terrible
relational database
Hadoop couldn’t live up to its
inflated expectations
compared to mature
commercial offerings

What’s SQL got do with it?
 High performance SQL is one of the key reasons why Hadoop is currently
under pressure
 Most users need SQL to access data in Hadoop (most of a business’ users
will be using BI or visualization tools to access data; only a few might be in
the domain of Data Scientists or Developers)
 Big Data Developers are under pressure from the business to maintain the
same interactive performance on Hadoop data as the business would have
had on their previous database systems
 Many SQL on Hadoop solutions simply don't meet expectations because
they are new, immature and perform badly
4

So why are SQL on Hadoop
products not performant?
5

6
• To run relatively simple queries quickly you need to reduce latency
• If you have a lot of overhead for starting and stopping containers to run tasks, that is a big
impediment to interactive usage, even if the actual processing is very efficient
Overhead of starting and
stopping processes
• Lots of commercial databases have been built on the shoulders of giants e.g. Greenplum,
Netezza, Redshift and others were derived from PostgreSQL which gives them a great start in
avoiding a lot of mistakes made in the past
• Most of the newer SQL on Hadoop products were built from scratch, so developers have to learn
and solve problems that were long-since addressed in commercial database products
Product immaturity
• If a product like Hive starts off based on MapReduce, its developers won’t start working on
incremental improvements to latency as they won’t have any effect
• If Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing
latency
Evolution from batch
processing

The four things you should look
for in a SQL on Hadoop solution
7

Performance
 Ensure that the queries you need to run can complete in a
time acceptable to your end users.
 Ensure that when you add concurrency, the above remains
true (so don't run a POC which revolves around a single
stream of queries, unless you really only have 1 user and
they only ever run 1 session!)
8

Maturity
 Use something that’s mature in terms of query optimization and
functionality. Many recent SQL on Hadoop products were
developed from the ground up, so are inherently less mature
than offerings that have been around for longer.
- Immaturity can be seen in problems when running all TPC-DS queries
 Ensure that ad-hoc queries still run, and are performant - i.e.
ensure that immaturity does not mean functionality/performance
is brittle
9

Interoperability
 Ensure the solution fits into the Hadoop environment, so
runs under YARN rather than requiring dedicated nodes
 It should use HDFS for storage, rather than having its own
separate storage
 Ensure it works effectively with your end user tools (e.g.
Tableau, Qlik, Microstrategy etc.)
10

Stability
 You shouldn’t need to tweak configuration as you add more
users or more data
 Ensure there’s an established support offering from a vendor
(especially for free-to-use or open source) to assist with
onboarding, and any issues encountered
11

Further reading
12
Try Kognitio on Hadoop
http://Kognitio.com/on-hadoop
“Hadoop has failed us, Tech Experts say”
https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/
“You’re doing Hadoop and Spark wrong, and they will probably fail”
https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/
“Has Hadoop failed? That’s the wrong question”
http://www.podiumdata.com/blog/has-hadoop-failed-s-wrong-question
The Growing need for SQL for Hadoop
http://news.dataloco.com/the-growing-need-for-sql-for-hadoop-upside?via=tw
“What do you mean SQL can’t do big data?”
https://www.linkedin.com/pulse/what-do-you-mean-sql-cant-big-data-rick-van-der-lans

Hadoop's Problem and How to Fix it

More Related Content

What's hot

Similar to Hadoop's Problem and How to Fix it

Recently uploaded

Hadoop's Problem and How to Fix it