Hadoop’s biggest
problem and how to
fix it
Mark Chopping, COO, Kognitio
March 2017
1
How the mighty elephant has fallen
Recently there has been an increase in headlines critical of Hadoop
“Hadoop has failed us, Tech Experts say”
https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/
“You’re doing Hadoop and Spark wrong, and they will probably fail”
https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/
“Has Hadoop failed? That’s the wrong question”
http://www.podiumdata.com/blog/has-hadoop-failed-s-wrong-question
2
What’s behind the headlines?
3
Hadoop is ill-suited for
running interactive, user-
facing applications.
The once “hot new
technology” is 10x slower
than the data warehouses
being used before.
Hadoop is a terrible
relational database
Hadoop couldn’t live up to its
inflated expectations
compared to mature
commercial offerings
What’s SQL got do with it?
 High performance SQL is one of the key reasons why Hadoop is currently
under pressure
 Most users need SQL to access data in Hadoop (most of a business’ users
will be using BI or visualization tools to access data; only a few might be in
the domain of Data Scientists or Developers)
 Big Data Developers are under pressure from the business to maintain the
same interactive performance on Hadoop data as the business would have
had on their previous database systems
 Many SQL on Hadoop solutions simply don't meet expectations because
they are new, immature and perform badly
4
So why are SQL on Hadoop
products not performant?
5
6
• To run relatively simple queries quickly you need to reduce latency
• If you have a lot of overhead for starting and stopping containers to run tasks, that is a big
impediment to interactive usage, even if the actual processing is very efficient
Overhead of starting and
stopping processes
• Lots of commercial databases have been built on the shoulders of giants e.g. Greenplum,
Netezza, Redshift and others were derived from PostgreSQL which gives them a great start in
avoiding a lot of mistakes made in the past
• Most of the newer SQL on Hadoop products were built from scratch, so developers have to learn
and solve problems that were long-since addressed in commercial database products
Product immaturity
• If a product like Hive starts off based on MapReduce, its developers won’t start working on
incremental improvements to latency as they won’t have any effect
• If Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing
latency
Evolution from batch
processing
The four things you should look
for in a SQL on Hadoop solution
7
Performance
 Ensure that the queries you need to run can complete in a
time acceptable to your end users.
 Ensure that when you add concurrency, the above remains
true (so don't run a POC which revolves around a single
stream of queries, unless you really only have 1 user and
they only ever run 1 session!)
8
Maturity
 Use something that’s mature in terms of query optimization and
functionality. Many recent SQL on Hadoop products were
developed from the ground up, so are inherently less mature
than offerings that have been around for longer.
- Immaturity can be seen in problems when running all TPC-DS queries
 Ensure that ad-hoc queries still run, and are performant - i.e.
ensure that immaturity does not mean functionality/performance
is brittle
9
Interoperability
 Ensure the solution fits into the Hadoop environment, so
runs under YARN rather than requiring dedicated nodes
 It should use HDFS for storage, rather than having its own
separate storage
 Ensure it works effectively with your end user tools (e.g.
Tableau, Qlik, Microstrategy etc.)
10
Stability
 You shouldn’t need to tweak configuration as you add more
users or more data
 Ensure there’s an established support offering from a vendor
(especially for free-to-use or open source) to assist with
onboarding, and any issues encountered
11
Further reading
12
Try Kognitio on Hadoop
http://Kognitio.com/on-hadoop
“Hadoop has failed us, Tech Experts say”
https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/
“You’re doing Hadoop and Spark wrong, and they will probably fail”
https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/
“Has Hadoop failed? That’s the wrong question”
http://www.podiumdata.com/blog/has-hadoop-failed-s-wrong-question
The Growing need for SQL for Hadoop
http://news.dataloco.com/the-growing-need-for-sql-for-hadoop-upside?via=tw
“What do you mean SQL can’t do big data?”
https://www.linkedin.com/pulse/what-do-you-mean-sql-cant-big-data-rick-van-der-lans

Hadoop's Problem and How to Fix it

  • 1.
    Hadoop’s biggest problem andhow to fix it Mark Chopping, COO, Kognitio March 2017 1
  • 2.
    How the mightyelephant has fallen Recently there has been an increase in headlines critical of Hadoop “Hadoop has failed us, Tech Experts say” https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/ “You’re doing Hadoop and Spark wrong, and they will probably fail” https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/ “Has Hadoop failed? That’s the wrong question” http://www.podiumdata.com/blog/has-hadoop-failed-s-wrong-question 2
  • 3.
    What’s behind theheadlines? 3 Hadoop is ill-suited for running interactive, user- facing applications. The once “hot new technology” is 10x slower than the data warehouses being used before. Hadoop is a terrible relational database Hadoop couldn’t live up to its inflated expectations compared to mature commercial offerings
  • 4.
    What’s SQL gotdo with it?  High performance SQL is one of the key reasons why Hadoop is currently under pressure  Most users need SQL to access data in Hadoop (most of a business’ users will be using BI or visualization tools to access data; only a few might be in the domain of Data Scientists or Developers)  Big Data Developers are under pressure from the business to maintain the same interactive performance on Hadoop data as the business would have had on their previous database systems  Many SQL on Hadoop solutions simply don't meet expectations because they are new, immature and perform badly 4
  • 5.
    So why areSQL on Hadoop products not performant? 5
  • 6.
    6 • To runrelatively simple queries quickly you need to reduce latency • If you have a lot of overhead for starting and stopping containers to run tasks, that is a big impediment to interactive usage, even if the actual processing is very efficient Overhead of starting and stopping processes • Lots of commercial databases have been built on the shoulders of giants e.g. Greenplum, Netezza, Redshift and others were derived from PostgreSQL which gives them a great start in avoiding a lot of mistakes made in the past • Most of the newer SQL on Hadoop products were built from scratch, so developers have to learn and solve problems that were long-since addressed in commercial database products Product immaturity • If a product like Hive starts off based on MapReduce, its developers won’t start working on incremental improvements to latency as they won’t have any effect • If Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing latency Evolution from batch processing
  • 7.
    The four thingsyou should look for in a SQL on Hadoop solution 7
  • 8.
    Performance  Ensure thatthe queries you need to run can complete in a time acceptable to your end users.  Ensure that when you add concurrency, the above remains true (so don't run a POC which revolves around a single stream of queries, unless you really only have 1 user and they only ever run 1 session!) 8
  • 9.
    Maturity  Use somethingthat’s mature in terms of query optimization and functionality. Many recent SQL on Hadoop products were developed from the ground up, so are inherently less mature than offerings that have been around for longer. - Immaturity can be seen in problems when running all TPC-DS queries  Ensure that ad-hoc queries still run, and are performant - i.e. ensure that immaturity does not mean functionality/performance is brittle 9
  • 10.
    Interoperability  Ensure thesolution fits into the Hadoop environment, so runs under YARN rather than requiring dedicated nodes  It should use HDFS for storage, rather than having its own separate storage  Ensure it works effectively with your end user tools (e.g. Tableau, Qlik, Microstrategy etc.) 10
  • 11.
    Stability  You shouldn’tneed to tweak configuration as you add more users or more data  Ensure there’s an established support offering from a vendor (especially for free-to-use or open source) to assist with onboarding, and any issues encountered 11
  • 12.
    Further reading 12 Try Kognitioon Hadoop http://Kognitio.com/on-hadoop “Hadoop has failed us, Tech Experts say” https://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/ “You’re doing Hadoop and Spark wrong, and they will probably fail” https://www.theregister.co.uk/2017/02/21/hadoop_and_spark_risks_and_opportunities/ “Has Hadoop failed? That’s the wrong question” http://www.podiumdata.com/blog/has-hadoop-failed-s-wrong-question The Growing need for SQL for Hadoop http://news.dataloco.com/the-growing-need-for-sql-for-hadoop-upside?via=tw “What do you mean SQL can’t do big data?” https://www.linkedin.com/pulse/what-do-you-mean-sql-cant-big-data-rick-van-der-lans