Big Data and MicroStrategy:
           Building a Bridge for the Elephant

                 Paul Groom, Chief Innovation Officer
Jan 2013
Let’s start at…

           The End.
Panacea
You…built the E
              DW
You…built the BICC
and yes you built…
lots of cool reports and dashboards
Epilogue
A comfortable status quo
How are you really judged?




                             • Fast?
                             • Consistent?
                             • All users?
Rrrrrriiiiiiinnnnnngggggg!




                 Back to the real world
Disruption
Disruptor: New Data
Disruptor: Social Media & Sentiment
Disruptor:




             Data ?
Disruptor: More Connected Users
Disruptor: Data Discovery Tools

Choices for engaging quickly with data




Business users head’s distracted from core BI!
BI Wild West
Where it matters
Lots of variety of DW and EDW
The Reality of the DW


 analytical workload
EDW says no or not now!
…and CFO says no big upgrades
Pragmatism



…ok so you enable plenty of caching,
 limit drill anywhere
 and add Intelligent Cubes
And then came…
Distraction
                                          or
                                   Boon




http://oris-rake.deviantart.com/
Scalable, resilient, bit bucket
Experimenting




   © 20th Century Fox
The Hadoop stack



                                              Pig              Hive
         ZooKepper / Ambari




                                      HBase
                                                MapReduce
                              Oozie



                                                    HCatalog


                                               HDFS
Hadoop Performance Reality
• Hadoop is batch oriented
• HDFS access is fast but crude
• MapReduce is powerful but has overheads
     – ~30 second base response time
     – Too much latency in stack and processing model
     – Trade-off in optimization and latency
• MapReduce complex
     – Typically multiple Java routines



https://www.facebook.com/notes/facebook-engineering/
under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-
corona/10151142560538920
SQL to the Rescue
• So MapReduce is complicated

    – use Hive (SQL) as the easy way out


                                             Pig              Hive
        ZooKepper / Ambari




                                     HBase

                                               MapReduce
                             Oozie




                                                   HCatalog


                                              HDFS
Hive
• Simplifies access
    Hive is great, but Hadoop’s execution engine
    “
    makes even the smallest queries take minutes!”
• Only basic SQL support
• Concurrency needs careful system admin
• It’s not a silver bullet for interactive BI usage
Conclusion


 Hadoop just too slow
 for interactive BI!
         “while hadoop shines as a processing
          platform, it is painfully slow as a query tool”

   …loss of train-of-thought
Hive is based on Hadoop which is a batch processing system. Accordingly,
this system does not and cannot promise low latencies on queries. The
paradigm here is strictly of submitting jobs and being notified when the jobs
are completed as opposed to real time queries. As a result it should not be
compared with systems like Oracle where analysis is done on a
significantly smaller amount of data but the analysis proceeds much more
iteratively with the response times between iterations being less than a few
minutes. For Hive queries response times for even the smallest jobs
can be of the order of 5-10 minutes and for larger jobs this may even
run into hours.

I remain skeptical on the practical performance of the Hive query approach
and have yet to talk to any beta customers. A more practical approach is
loading some of the Hadoop data into the in-memory cube with the new
Hadoop connector.
Why can’t Hadoop
Why can’t I have a   be in-memory?
giant icubes?
Remember…


Lots of these
Hadoop inherently disk oriented



Not so many of these
Typically low ratio of CPU to Disk
Larger cubes

 Issues: Time to Populate, Proliferation
Alternative - In-memory Processing


  Analyticsdo the work!
    Cores requires CPU,
  RAM keeps the data close
    Scale with the data
Goals: Minimise Disruption, Cut Latency
• Don’t change the existing BI and analytics
• Support more creative and dynamic BI
• Don’t introduce yet more slow disk
     – Help the DW investment
•   No complex ETL, just pull data as required
•   Pull data simply and intelligently from Hadoop
•   Simplify – less cubes, caches
•   Improve sharing of data
•   Increase concurrency and throughput
     – Its all about queries per hour!
• Minimal DBA requirement
Kognitio Hadoop Connectors
HDFS Connector
• Connector defines access to hdfs file system
• External table accesses row-based data
  in hdfs
• Dynamic access or “pin” data into memory
• Selected hdfs file(s) loaded into memory




Filter Agent Connector
• Connector uploads agent to Hadoop nodes
• Query passes selections and relevant
  predicates to agent
• Data filtering and projection takes place
  locally on each Hadoop node
• Only data of interest is loaded into memory
  via parallel load streams
BI – Central Governance

Centrally defined data models
Persist data in natural store
Fetch when needed, agile
Available to all tools
                 Analytical power
Engineering for Success




 Thomas Herbrich
connect
                                   NA: +1 855  KOGNITIO
www.kognitio.com                   EMEA: +44 1344 300 770

linkedin.com/companies/kognitio    twitter.com/kognitio

tinyurl.com/kognitio               youtube.com/kognitio

Big data and mstr bridge the elephant

  • 1.
    Big Data andMicroStrategy: Building a Bridge for the Elephant Paul Groom, Chief Innovation Officer Jan 2013
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
    and yes youbuilt… lots of cool reports and dashboards
  • 7.
  • 8.
    How are youreally judged? • Fast? • Consistent? • All users?
  • 10.
    Rrrrrriiiiiiinnnnnngggggg! Back to the real world
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Disruptor: Data DiscoveryTools Choices for engaging quickly with data Business users head’s distracted from core BI!
  • 17.
  • 18.
  • 20.
    Lots of varietyof DW and EDW
  • 21.
    The Reality ofthe DW analytical workload
  • 22.
    EDW says noor not now! …and CFO says no big upgrades
  • 23.
    Pragmatism …ok so youenable plenty of caching, limit drill anywhere and add Intelligent Cubes
  • 25.
  • 26.
    Distraction or Boon http://oris-rake.deviantart.com/
  • 27.
  • 28.
    Experimenting © 20th Century Fox
  • 29.
    The Hadoop stack Pig Hive ZooKepper / Ambari HBase MapReduce Oozie HCatalog HDFS
  • 30.
    Hadoop Performance Reality •Hadoop is batch oriented • HDFS access is fast but crude • MapReduce is powerful but has overheads – ~30 second base response time – Too much latency in stack and processing model – Trade-off in optimization and latency • MapReduce complex – Typically multiple Java routines https://www.facebook.com/notes/facebook-engineering/ under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with- corona/10151142560538920
  • 31.
    SQL to theRescue • So MapReduce is complicated – use Hive (SQL) as the easy way out Pig Hive ZooKepper / Ambari HBase MapReduce Oozie HCatalog HDFS
  • 32.
    Hive • Simplifies access Hive is great, but Hadoop’s execution engine “ makes even the smallest queries take minutes!” • Only basic SQL support • Concurrency needs careful system admin • It’s not a silver bullet for interactive BI usage
  • 33.
    Conclusion Hadoop justtoo slow for interactive BI! “while hadoop shines as a processing platform, it is painfully slow as a query tool” …loss of train-of-thought
  • 34.
    Hive is basedon Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours. I remain skeptical on the practical performance of the Hive query approach and have yet to talk to any beta customers. A more practical approach is loading some of the Hadoop data into the in-memory cube with the new Hadoop connector.
  • 36.
    Why can’t Hadoop Whycan’t I have a be in-memory? giant icubes?
  • 37.
    Remember… Lots of these Hadoopinherently disk oriented Not so many of these Typically low ratio of CPU to Disk
  • 38.
    Larger cubes Issues:Time to Populate, Proliferation
  • 39.
    Alternative - In-memoryProcessing Analyticsdo the work! Cores requires CPU, RAM keeps the data close Scale with the data
  • 40.
    Goals: Minimise Disruption,Cut Latency • Don’t change the existing BI and analytics • Support more creative and dynamic BI • Don’t introduce yet more slow disk – Help the DW investment • No complex ETL, just pull data as required • Pull data simply and intelligently from Hadoop • Simplify – less cubes, caches • Improve sharing of data • Increase concurrency and throughput – Its all about queries per hour! • Minimal DBA requirement
  • 42.
    Kognitio Hadoop Connectors HDFSConnector • Connector defines access to hdfs file system • External table accesses row-based data in hdfs • Dynamic access or “pin” data into memory • Selected hdfs file(s) loaded into memory Filter Agent Connector • Connector uploads agent to Hadoop nodes • Query passes selections and relevant predicates to agent • Data filtering and projection takes place locally on each Hadoop node • Only data of interest is loaded into memory via parallel load streams
  • 43.
    BI – CentralGovernance Centrally defined data models Persist data in natural store Fetch when needed, agile Available to all tools Analytical power
  • 44.
    Engineering for Success Thomas Herbrich
  • 45.
    connect NA: +1 855  KOGNITIO www.kognitio.com EMEA: +44 1344 300 770 linkedin.com/companies/kognitio twitter.com/kognitio tinyurl.com/kognitio youtube.com/kognitio