0
Grab some coffee and enjoy
the pre-show banter before
the top of the hour!
Outside the Box: Alternate Query Models & the Future of Big Data

The Briefing Room
Welcome

Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com

Twitter Tag: #briefr

The Briefing Room
Mission

!   Reveal the essential characteristics of enterprise software,
good and bad
!   Provide a forum for detailed an...
Topics

This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at

www.insideanalysis.com/we...
Data Discovery & Visualization

INNOVATORS
Twitter Tag: #briefr

The Briefing Room
Analyst: Robin Bloor

Robin Bloor is
Chief Analyst at
The Bloor Group	
	

robin.bloor@bloorgroup.com

Twitter Tag: #briefr...
Infobright
! Infobright’s columnar database is used for applications and
data marts that analyze large volumes of machineg...
Guests: Don DeLoach and Jeff Kibler
Don DeLoach is CEO and
President of Infobright

Jeff Kibler is Senior Technical
Archit...
Turning	
  “Huh?”	
  into	
  “Aha!”	
  
Alternate	
  Query	
  Models	
  and	
  Big	
  Data	
  Analy;cs	
  
About Infobright
§  400+	
  direct	
  and	
  OEM	
  customers	
  across	
  North	
  America,	
  EMEA	
  and	
  Asia	
  
§...
Core Competencies

Columnar	
  
Database	
  

Intelligence,	
  
not	
  Hardware	
  

Administra:ve	
  
Simplicity	
  

Des...
Machine-Generated Data Is Everywhere
§  Weblogs	
  
§  Computer,	
  network	
  events	
  
§  Call	
  detail	
  records	...
Internet of Things is a Multiplier for EVERYTHING
Emerging Data Analytics Stack:
Days of One-Size-Fits-All Are Gone
“Yesterday’s	
  BI-­‐ETL-­‐EDW	
  stack	
  is	
  wrong-­...
Infobright: Columnar Architecture
Column Orientation

Knowledge	
  Grid	
  –	
  sta:s:cs	
  and	
  
metadata	
  “describin...
The Knowledge Grid
Knowledge	
  Grid	
  

Knowledge	
  Nodes	
  

applies	
  to	
  the	
  whole	
  table

built	
  for	
  ...
Optimizer / Granular Engine
1. 
2. 
3. 
4. 
	
  
	
  

Query	
  received	
  
Engine	
  iterates	
  on	
  Knowledge	
  Grid...
Infobright Architecture: Data Packs and Compression
Data	
  Packs	
  
§  Each	
  data	
  pack	
  contains	
  65,536	
  da...
What Your Data Looks Like Now
Original	
  data	
  

Compressed	
  data	
  

10TB	
  

50	
  GB	
  

=

Avg	
  compression	...
Alternate Query Models: When Good Enough Works
§  “Principle	
  of	
  exactness”	
  the	
  
default	
  for	
  most	
  dat...
Tools for Investigative Analysis

Today, Infobright provides:
§  Standard Queries: Knowledge Grid is used to
aid performa...
Tools for Investigative Analysis

Fast and Informative:
§  Approximate Queries: Uses a combination of
the Knowledge Grid ...
Use Case
§  Approximate Query useful when looking for data in an exploratory fashion
(e.g. anomalous events, understandin...
Example: Online Advertising Segmentation

Approximate Queries

Traditional Queries

The goal in this example is to create ...
Big Data Analytics At the End of the Day

AD HOC
PERFORMANCE

SCALABILITY

LOAD SPEEDS

HIGH AVAILABILITY

LOW TOUCH

COMP...
Thank	
  you!	
  
Perceptions & Questions

Analyst:
Robin Bloor

Twitter Tag: #briefr

The Briefing Room
The Current Disposition

u 
u 
u 
u 
u 
u 

10 bn connected devices
13 to 14 bn new processors
embedded every year
E...
IOT Data Characteristics
u 
u 
u 
u 
u 
u 

Arrives in continuous streams
Generally reliable (i.e., not
in need of c...
IOT Apps and Database
u 
u 
u 
u 

u 
u 

Mostly streaming – for alerts
and BI (analysis, discovery)
DBMS choice is ...
The Coming Inversion
1. Instrument existing
(dumb) devices

2. Gather and analyze
data

3. Redesign device and
its instrum...
Going Forward

In terms of

DATA VOLUMES
we expect the

IOT DATA VOLUME
to swamp all other
sources of data
u  Do

the high compression rates you achieve occur
because it is machine data, i.e., it’s a function of
the characterist...
u  What

“relationship” does Infobright favor with
Hadoop?

u  What

statistical functions, if any, does Infobright
offe...
Twitter Tag: #briefr

The Briefing Room
Upcoming Topics

This Month: INNOVATORS
January: ANALYTICS
February: BIG DATA
2014 Editorial Calendar at

www.insideanalys...
Thank You
for Your
Attention

Twitter Tag: #briefr

The Briefing Room
Outside the Box: Alternate Query Models and the Future of Big Data
Upcoming SlideShare
Loading in...5
×

Outside the Box: Alternate Query Models and the Future of Big Data

2,065

Published on

The Briefing Room with Dr. Robin Bloor and Infobright
Live Webcast Dec. 17
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7950017&rKey=9b6b134099af5b46

How big a role will Big Data play in the future of analytics? There’s no question that all flavors of Big Data are here to stay, especially the rising waters of machine-generated data. Cramming all the details you want into a giant data warehouse will no longer be tenable, which means other, more federated solutions must arise. That’s where alternate query models will save the day.

Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains how the widening landscape of Big Data will continue to transform the manner in which analytics are done. He’ll be briefed by Don DeLoach of Infobright, who will tout his company’s Big Data strategy, which focuses on greatly expediting the process of doing analysis on large sets of federated data.

Visit InsideAnalysis.com for more information

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,065
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Outside the Box: Alternate Query Models and the Future of Big Data"

  1. 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  2. 2. Outside the Box: Alternate Query Models & the Future of Big Data The Briefing Room
  3. 3. Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.com Twitter Tag: #briefr The Briefing Room
  4. 4. Mission !   Reveal the essential characteristics of enterprise software, good and bad !   Provide a forum for detailed analysis of today s innovative technologies !   Give vendors a chance to explain their product to savvy analysts !   Allow audience members to pose serious questions... and get answers! Twitter Tag: #briefr The Briefing Room
  5. 5. Topics This Month: INNOVATORS January: ANALYTICS February: BIG DATA 2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room Twitter Tag: #briefr The Briefing Room
  6. 6. Data Discovery & Visualization INNOVATORS Twitter Tag: #briefr The Briefing Room
  7. 7. Analyst: Robin Bloor Robin Bloor is Chief Analyst at The Bloor Group robin.bloor@bloorgroup.com Twitter Tag: #briefr The Briefing Room
  8. 8. Infobright ! Infobright’s columnar database is used for applications and data marts that analyze large volumes of machinegenerated data !   It leverages patented compression and optimization techniques, and a “knowledge grid,” to achieve real-time analytics ! Infobright offers a commercial version of its software, as well as a freely-available, open source product Twitter Tag: #briefr The Briefing Room
  9. 9. Guests: Don DeLoach and Jeff Kibler Don DeLoach is CEO and President of Infobright Jeff Kibler is Senior Technical Architect for Infobright Twitter Tag: #briefr The Briefing Room
  10. 10. Turning  “Huh?”  into  “Aha!”   Alternate  Query  Models  and  Big  Data  Analy;cs  
  11. 11. About Infobright §  400+  direct  and  OEM  customers  across  North  America,  EMEA  and  Asia   §  1,000  installa:ons   §  8  of  Top  10  Global  Telecom  Carriers  use  Infobright  via  OEM/ISVs   Logis;cs,   Manufacturing,   Business   Intelligence     Online  &  Mobile  Adver;sing/Web   Analy;cs,  eCommerce,  Social  Networks   Government,   U;li;es,   Research     Financial  Services     Telecom,  Security    
  12. 12. Core Competencies Columnar   Database   Intelligence,   not  Hardware   Administra:ve   Simplicity   Designed  for   fast  analy:cs   Knowledge   Grid   No  manual   tuning   Deep  data   compression   Itera:ve   Engine   Minimal   ongoing   administra:on  
  13. 13. Machine-Generated Data Is Everywhere §  Weblogs   §  Computer,  network  events   §  Call  detail  records   §  Financial  trade  data   §  Sensors,  RFID   §  Online  game  data   Businesses  need  to  extract  insight  in  near-­‐real  ;me  from  rapidly  growing  data   volume:   •  Segment  and  target  website  visitors   •  Troubleshoot  networks   •  Iden7fy  security  threats  and  fraud   •  Op7mize  online/mobile  ads  
  14. 14. Internet of Things is a Multiplier for EVERYTHING
  15. 15. Emerging Data Analytics Stack: Days of One-Size-Fits-All Are Gone “Yesterday’s  BI-­‐ETL-­‐EDW  stack  is  wrong-­‐sided  for  tomorrow’s   needs,  and  quickly  becoming  irrelevant.”  Gigamon   §  Data  management   §  Hadoop  transforming  this  area   §  Transparent  analy:c  stack   §  Opera:onal,  inves:ga:ve,  predic:ve     §  Machine-­‐generated,  text   §  User  consump:on     §  Real-­‐:me,  interac:ve  visualiza:on  &  query  crea:on   §  Data  Center  /  Data  Warehouse   §  Infrastructure  strategies,  op:ons  prolifera:ng  
  16. 16. Infobright: Columnar Architecture Column Orientation Knowledge  Grid  –  sta:s:cs  and   metadata  “describing”    the  super-­‐ compressed  data   Data  Packs  –  data  stored     in  manageably  sized,   highly  compressed  data   packs   Data  compressed  using   algorithms  tailored  to     data  type   Smarter  architecture     §  Load  data  and  go   §  No  indices  or  par::ons     to  build  and  maintain   §  Knowledge  Grid   automa:cally  updated  as   data  packs  are  created  or   updated   §  Super-­‐compact  data  foot-­‐   print  can  leverage  off-­‐the-­‐ shelf  hardware  
  17. 17. The Knowledge Grid Knowledge  Grid   Knowledge  Nodes   applies  to  the  whole  table built  for  each  Data  Pack Informa:on  about  the  data   Column   A Column  A     DP1   DP2   DP3   DP4   DP5   DP6   Column  B   …   Global  knowledge   String  and  character  data   Numeric  data   Built  during    LOAD   Distribu;ons   Dynamic  knowledge   §   Knowledge  Nodes  answer  the  query  directly,  or   §   Iden:fy  only  required  Data  Packs,  minimizing  decompression,  and   §   Predict  required  data  in  advance  based  on  workload   Built  per  query   E.g.  for   aggregates,  joins  
  18. 18. Optimizer / Granular Engine 1.  2.  3.  4.      Query  received   Engine  iterates  on  Knowledge  Grid   Each  pass  eliminates  Data  Packs   If  any  Data  Packs  are  needed  to  resolve  query,  only  those  are  decompressed   Query Knowledge  Grid Results 1% Q:  How  are  my   sales  doing  this   year? Compressed  Data
  19. 19. Infobright Architecture: Data Packs and Compression Data  Packs   §  Each  data  pack  contains  65,536  data  values   §  Compression  is  applied  to  each  individual  data  pack   64K   §  The  compression  algorithm  varies  depending  on  data  type  and   distribu:on   64K   Compression   §  Results  vary  depending  on  the  distribu:on   64K   64K   Patent-­‐Pending   Compression   Algorithms   of  data  among  data  packs   §  A  typical  overall  compression  ra:o  seen  in   the  field  is  10:1   §  Some  customers  have  seen  results  of  40:1   and  higher   §  For  example,  1TB  of  raw  data  compressed   10  to  1  would  only  require  100GB  of  disk   capacity  
  20. 20. What Your Data Looks Like Now Original  data   Compressed  data   10TB   50  GB   = Avg  compression  ra:o  of  20:1   + Knowledge  Grid   <  .5  GB   <  1%  of  compressed  data
  21. 21. Alternate Query Models: When Good Enough Works §  “Principle  of  exactness”  the   default  for  most  data  analy:cs   and  access  systems  today   §  Using  “approximate  queries”   good  enough  answers  can  be   found  using  less  resources   §  Works  best  when  given  the   ability  to  alternate  between   approxima:on  and  exactness  in   an  easy  way   §  Crea:ng  an  interac:vity  that   accelerates  :me  to  answers  and   reduces  compu:ng  resources   required  
  22. 22. Tools for Investigative Analysis Today, Infobright provides: §  Standard Queries: Knowledge Grid is used to aid performance, only required data packs are opened, retrieves exact results §  Rough Queries: Only Knowledge Grid is used to derive an answer quickly, typically for analytics like SUM, AVG, MAX
  23. 23. Tools for Investigative Analysis Fast and Informative: §  Approximate Queries: Uses a combination of the Knowledge Grid and Intelligent Random Sampling to return results very quickly applicable for any type of query §  Exact results are not important §  Top-N type queries §  Investigative Analytics
  24. 24. Use Case §  Approximate Query useful when looking for data in an exploratory fashion (e.g. anomalous events, understanding data characteristics) §  Example: Find the “Top-10” protocols and ports extracted from event records. §  Exact Query may take minutes, Approximate Query can answer in seconds. What’s important is the Top-10 not necessarily the exact numbers EXACT QUERY   DY_HR   SUM(TDR)   AP_NAME   8   14269152  DNS   8   13716936  HTTP-80   8   13527636  HTTPS-443   8   13044432  UNDEFINED   8   11486904  NO APPL PORT   8   4280412  UNDEFINED   8   2313288  HTTP-ALT-8080   8   1278876  5223   8   1214100  DNS-53   8   991560  NO APPL PORT   8   899220  XMPP-Client   APPROXIMATE QUERY   DY_HR   SUM(TDR)   AP_NAME   8   16872663  HTTP-80   8   15361320  DNS   8   14528793  HTTPS-443   8   13578984  UNDEFINED   8   11613616  NO APPL PORT   8   3659742  UNDEFINED   8   2724149  HTTP-ALT-8080   8   1427824  5223   8   1194147  DNS-53   8   1083973  NO APPL PORT   8   967579  XMPP-Client  
  25. 25. Example: Online Advertising Segmentation Approximate Queries Traditional Queries The goal in this example is to create a targeted campaign. They have a minimum number of participants that have to be included in the target group Find the top n individuals who meet criteria 1 Then find the top m individuals who meet criteria 1 and criteria 2 This process can take a considerable amount of time Approximate query could dramatically save the amount of time it takes to determine which set of criteria they should use This is repeated until they are in the range that that want to work with, and there can be up to 1500 different criteria, though they normally stop after 7 or 8 different filters They also have to a look at how many individuals who are in each permutation of the criteria. They can (if desired) use exact queries to calculate the exact final numbers, instead of having to do exact queries for all the runs. This process can collapse an effort that takes hours into minutes or seconds
  26. 26. Big Data Analytics At the End of the Day AD HOC PERFORMANCE SCALABILITY LOAD SPEEDS HIGH AVAILABILITY LOW TOUCH COMPRESSION TCO AFFORDABILITY
  27. 27. Thank  you!  
  28. 28. Perceptions & Questions Analyst: Robin Bloor Twitter Tag: #briefr The Briefing Room
  29. 29. The Current Disposition u  u  u  u  u  u  10 bn connected devices 13 to 14 bn new processors embedded every year Estimate 31 bn connected devices by 2020 Sensors, RFID tags, DSPs, FPGAs, CPUs, etc. To control, alert, log and report Data growth at 55% pa
  30. 30. IOT Data Characteristics u  u  u  u  u  u  Arrives in continuous streams Generally reliable (i.e., not in need of cleansing) Very high volume “Big tables” of predictably structured data So, very little need for ETL activity If “valuable” then processing speed is likely to be critical
  31. 31. IOT Apps and Database u  u  u  u  u  u  Mostly streaming – for alerts and BI (analysis, discovery) DBMS choice is a “horses for courses” thing If performance matters, probably not a Hadoop app The data structure does not favor the prominent NoSQL DBMSs Traditional RDBMS will not do well Hence column-store approach is most logical
  32. 32. The Coming Inversion 1. Instrument existing (dumb) devices 2. Gather and analyze data 3. Redesign device and its instrumentation from knowledge gained 4. Iterate
  33. 33. Going Forward In terms of DATA VOLUMES we expect the IOT DATA VOLUME to swamp all other sources of data
  34. 34. u  Do the high compression rates you achieve occur because it is machine data, i.e., it’s a function of the characteristics of the data? u  Is the “approximate query” an Infobright invention? u  How frequently do customers use this type of query and for what type of applications? u  Who, typically, are the Infobright end users?
  35. 35. u  What “relationship” does Infobright favor with Hadoop? u  What statistical functions, if any, does Infobright offer? u  What does the product roadmap look like?
  36. 36. Twitter Tag: #briefr The Briefing Room
  37. 37. Upcoming Topics This Month: INNOVATORS January: ANALYTICS February: BIG DATA 2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room www.insideanalysis.com Twitter Tag: #briefr The Briefing Room
  38. 38. Thank You for Your Attention Twitter Tag: #briefr The Briefing Room
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×