Big data and mstr bridge the elephant

Big Data and MicroStrategy:
Building a Bridge for the Elephant

Paul Groom, Chief Innovation Officer
Jan 2013

Let’s start at…

The End.

You…built the E
DW

and yes you built…
lots of cool reports and dashboards

Epilogue
A comfortable status quo

How are you really judged?

• Fast?
• Consistent?
• All users?

Rrrrrriiiiiiinnnnnngggggg!

Back to the real world

Disruptor: Social Media & Sentiment

Disruptor: More Connected Users

Disruptor: Data Discovery Tools

Choices for engaging quickly with data

Business users head’s distracted from core BI!

The Reality of the DW

analytical workload

EDW says no or not now!
…and CFO says no big upgrades

Pragmatism

…ok so you enable plenty of caching,
limit drill anywhere
and add Intelligent Cubes

Distraction
or
Boon

http://oris-rake.deviantart.com/

Scalable, resilient, bit bucket

Experimenting

© 20th Century Fox

The Hadoop stack

Pig Hive
ZooKepper / Ambari

HBase
MapReduce
Oozie

HCatalog

HDFS

Hadoop Performance Reality
• Hadoop is batch oriented
• HDFS access is fast but crude
• MapReduce is powerful but has overheads
– ~30 second base response time
– Too much latency in stack and processing model
– Trade-off in optimization and latency
• MapReduce complex
– Typically multiple Java routines

https://www.facebook.com/notes/facebook-engineering/
under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-
corona/10151142560538920

SQL to the Rescue
• So MapReduce is complicated

– use Hive (SQL) as the easy way out

Pig Hive
ZooKepper / Ambari

HBase

MapReduce
Oozie

HCatalog

HDFS

Hive
• Simplifies access
Hive is great, but Hadoop’s execution engine
“
makes even the smallest queries take minutes!”
• Only basic SQL support
• Concurrency needs careful system admin
• It’s not a silver bullet for interactive BI usage

Conclusion

Hadoop just too slow
for interactive BI!
“while hadoop shines as a processing
platform, it is painfully slow as a query tool”

…loss of train-of-thought

Hive is based on Hadoop which is a batch processing system. Accordingly,
this system does not and cannot promise low latencies on queries. The
paradigm here is strictly of submitting jobs and being notified when the jobs
are completed as opposed to real time queries. As a result it should not be
compared with systems like Oracle where analysis is done on a
significantly smaller amount of data but the analysis proceeds much more
iteratively with the response times between iterations being less than a few
minutes. For Hive queries response times for even the smallest jobs
can be of the order of 5-10 minutes and for larger jobs this may even
run into hours.

I remain skeptical on the practical performance of the Hive query approach
and have yet to talk to any beta customers. A more practical approach is
loading some of the Hadoop data into the in-memory cube with the new
Hadoop connector.

Why can’t Hadoop
Why can’t I have a be in-memory?
giant icubes?

Remember…

Lots of these
Hadoop inherently disk oriented

Not so many of these
Typically low ratio of CPU to Disk

Larger cubes

Issues: Time to Populate, Proliferation

Alternative - In-memory Processing

Analyticsdo the work!
Cores requires CPU,
RAM keeps the data close
Scale with the data

Goals: Minimise Disruption, Cut Latency
• Don’t change the existing BI and analytics
• Support more creative and dynamic BI
• Don’t introduce yet more slow disk
– Help the DW investment
• No complex ETL, just pull data as required
• Pull data simply and intelligently from Hadoop
• Simplify – less cubes, caches
• Improve sharing of data
• Increase concurrency and throughput
– Its all about queries per hour!
• Minimal DBA requirement

Kognitio Hadoop Connectors
HDFS Connector
• Connector defines access to hdfs file system
• External table accesses row-based data
in hdfs
• Dynamic access or “pin” data into memory
• Selected hdfs file(s) loaded into memory

Filter Agent Connector
• Connector uploads agent to Hadoop nodes
• Query passes selections and relevant
predicates to agent
• Data filtering and projection takes place
locally on each Hadoop node
• Only data of interest is loaded into memory
via parallel load streams

BI – Central Governance

Centrally defined data models
Persist data in natural store
Fetch when needed, agile
Available to all tools
Analytical power

Engineering for Success

Thomas Herbrich

connect
NA: +1 855 KOGNITIO
www.kognitio.com EMEA: +44 1344 300 770

linkedin.com/companies/kognitio twitter.com/kognitio

tinyurl.com/kognitio youtube.com/kognitio

Big data and mstr bridge the elephant

More Related Content

What's hot

Viewers also liked

Similar to Big data and mstr bridge the elephant

Recently uploaded

Big data and mstr bridge the elephant