Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Self-Service Access and Exploration of Big Data


Published on

The Briefing Room with Robin Bloor and Cirro
Live Webcast on Dec. 11, 2012

As the information landscape expands with all kinds of Big Data, businesses are searching for ways to unite their traditional analytics with this new source of insight. One ambitious approach involves federating access to multiple data sources, even across various operating systems. The idea is to take analytic processing to the data, then intelligently assemble the results for a business user. Could this be the long-awaited alternative to data virtualization?

Check out this episode of The Briefing Room to hear veteran Analyst Robin Bloor explain how federated access to data sources can pave the way for a truly integrated data fabric. Bloor will be briefed by Mark Theissen of Cirro, who will tout his company's patent-pending Data Hub, which simplifies data access by federating queries across multiple sources of structured, semi-structured, and unstructured data. He'll discuss Cirro's cost based optimizer, smart caching, dynamic query plan re-optimization, normalization of cost estimates and a metadata repository for unstructured data sources.


Published in: Technology
  • Be the first to comment

  • Be the first to like this

Self-Service Access and Exploration of Big Data

  1. 1. The Briefing Room
  2. 2. Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.comTwitter Tag: #briefr The Briefing Room
  3. 3. Mission !   Reveal the essential characteristics of enterprise software, good and bad !   Provide a forum for detailed analysis of today s innovative technologies !   Give vendors a chance to explain their product to savvy analysts !   Allow audience members to pose serious questions... and get answers!Twitter Tag: #briefr The Briefing Room
  4. 4. December: Innovators January: Big Data February: Analytics March: Data in MotionTwitter Tag: #briefr The Briefing Room
  5. 5. Innovators !   Charles Babbage conceived the Analytical Engine in 1834. !   Automation and ease of use have driven innovation in computing ever since. !   The Cloud and Big Data are raising the bar.Twitter Tag: #briefr The Briefing Room
  6. 6. Analyst: Robin Bloor  Robin Bloor is Chief Analyst at The Bloor Group robin.bloor@bloorgroup.comTwitter Tag: #briefr The Briefing Room
  7. 7. Cirro ! Cirro provides a single method to access any type of data, on any platform, in any environment. !   Its product suite consists of Cirro Data Hub, Analyst for Excel and Multi Store – all designed to remove complexity from Big Data analytics. ! Cirro’s products are cloud based and can run in public, private and on-premise environments.Twitter Tag: #briefr The Briefing Room
  8. 8. Mark Theissen Mark is CEO at Cirro. He is a respected analytics and data warehousing expert with more than 22 years in the industry. Most recently Mark was the worldwide data warehousing technical lead at Microsoft following the acquisition of DATAllegro. At DATAllegro Mark was the COO and a member of the board of directors. Prior to joining DATAllegro, Mark was Vice President and Research Lead at META Group (Gartner Group) for Enterprise Analytics Strategies, covering data warehousing, business intelligence and data integration markets. Before META, Mark was VP of Professional Services at Accruent where he was responsible for domestic and overseas services and operations. Mark has a BS in Computer Information Systems from Chapman University and a MBA from the University of California, Irvine. Twitter Tag: #briefr The Briefing Room
  9. 9. Bringing Big Data to the Desktop Corporate Overview ©2012 Cirro Inc. All rights reserved.
  10. 10. The Big Data Dilemma ©2012 Cirro Inc. All rights reserved.
  11. 11. The Big Data Dilemma ©2012 Cirro Inc. All rights reserved.
  12. 12. The Big Data Dilemma ©2012 Cirro Inc. All rights reserved.
  13. 13. Accessing Big Data ©2012 Cirro Inc. All rights reserved.
  14. 14. Accessing Big Data Incumbent  Approach   Hadoop  Approach   ©2012 Cirro Inc. All rights reserved.
  15. 15. Accessing Big Data Incumbent  Approach   Hadoop  Approach   ©2012 Cirro Inc. All rights reserved.
  16. 16. Accessing Big Data Incumbent  Approach   Hadoop  Approach   ©2012 Cirro Inc. All rights reserved.
  17. 17. What the Market Needs An enterprise data hub to access any type of data, on any platform, in any environment ©2012 Cirro Inc. All rights reserved.
  18. 18. The Enterprise Data Hub ©2012 Cirro Inc. All rights reserved.
  19. 19. Simplifying the Access to Your Data Conven/onal  Approach   Cirro  Approach   People  manage  the  access  to  data   Cirro  Data  Hub  manages     access  to  data   HIVE   Hadoop   Map   Install  &   Reduce   Config   Hive  –  Scoop   Access  tool   Sqoop   Install  &   Config   Source   Java   Control   SQL   Structured  -­‐   Cirro   (mul;ple   versions)   Unstructured   DataBase   Management   Data  Hub   Mashups   ©2012 Cirro Inc. All rights reserved.
  20. 20. Architecture OverviewCirro  Data  Hub   •  Cost  based  federa;on  op;mizer   •  Smart  caching     •  Dynamic  op;miza;on   •  Normalized  cost  es;mates   •  Metadata  for  unstructured  sources    Cirro  Func;on  Library   •  Library  of  Func;ons   •  Logic  to  build  complex  specific  formulas    Cirro  Analyst   •  Excel  plug-­‐in  that  allows  analysts  to  explore                  &  process  Big  Data  and  tradi;onal  data    Cirro  Mul;  Store  (op;onal)   •  Pre-­‐built  structured/unstructured  data  store   •  Used  for  holding  data  or  addi;onal  workspace     ©2012 Cirro Inc. All rights reserved.
  21. 21. Typical Deployment Excel Analyst Users Data Consumers •  Design Views Access CDH Views via ODBC •  Minimal IT Support & JDBC across all data types •  Publish Views •  Data Exploration •  Analysis Tableau Extend, Add Cirro Data Hub Proprietary •  Cirro Function Library Business Functions to CFL • Proprietary MapReduce Objects • Custom Views IT Staff •  Programmers Other BI Tools •  Developers •  DBA’s MapReduce   HQL   No  SQL   RDBMS   Splunk   Hive Oracle   Cassandra   MapReduce Teradata   MongoDB   MySQL   SQL     Hadoop Distributed File System Ver;ca   ©2012 Cirro Inc. All rights reserved.
  22. 22. Sample Use Case Summarize the number of tweets per hour with certain keywords from a raw twitter feed. Requirements: •  Use raw twitter data files in Hadoop •  Keywords stored in SQL table for easy manipulation •  Results into Tableau Excel for visualization ©2012 Cirro Inc. All rights reserved.
  23. 23. Too Many Skills, Coding, ProcessingWrite  mapper/reducer  in  java  using  development  tool  :     • parse  twi[er  text  -­‐  convert  to  lower  case  -­‐  parse  words  -­‐  exclude  common  words  -­‐  group  words  by  hour  Import  java  classes  into  Hadoop  Execute  command  line  hadoop  using  CLI   • bin/hadoop    jar  Twi[erParse    /home/cloudera/WordCount.jar  /usr/tweet/input  /usr/local/output  –libjars    Move  result  into  HIVE  using  JDBC  SQL  tool   • create  table  output1  (text  STRING,created_at  STRING,count  BIGINT)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY   t  STORED  AS  TEXTFILE     • LOAD  DATA  INPATH  /usr/data/1-­‐88f1-­‐864e22e77801/part*OVERWRITE  INTO  TABLE  output1  Move  SQL  table  with  keywords  to  HIVE  through  Scoop  using  CLI   • export  -­‐-­‐connect  jdbc:mySQL://  -­‐-­‐password    mypasswd  -­‐-­‐username  root  -­‐-­‐table  words  -­‐-­‐export-­‐dir   /home/cloudera/inpumile   • create  table  mytable  (word  STRING)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  ,  STORED  AS  TEXTFILE     • LOAD  DATA  INPATH  /home/cloudera/inpumile/part*OVERWRITE  INTO  TABLE  mytable  Run  HIVE  query  using  JDBC  SQL  tool   • select  a.text  ,a.created_at  ,a.count  from  output1  a    join  mytable  b    on  (a.text    =  b.word  )    Import  results  into  Excel  using  Excel   ©2012 Cirro Inc. All rights reserved.
  24. 24. Too Many Skills, Coding, ProcessingWrite  mapper/reducer  in  java  using  development  tool  :     • parse  twi[er  text  -­‐  convert  to  lower  case  -­‐  parse  words  -­‐  exclude  common  words  -­‐  group  words  by  hour  Import  java  classes  into  Hadoop  Execute  command  line  hadoop  using  CLI   • bin/hadoop    jar  Twi[erParse    /home/cloudera/WordCount.jar  /usr/tweet/input  /usr/local/output  –libjars     B1=Twi[erParse("/user/twi[er/sample","text,created_at")  Move  result  into  HIVE  using  JDBC  SQL  tool   B2=ToLower(B1,"text")   • create  table  output1  (text  STRING,created_at  STRING,count  BIGINT)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY   B3=WordSeparate(B2,"text")   t  STORED  AS  TEXTFILE     • LOAD  DATA  INPATH  /usr/data/1-­‐88f1-­‐864e22e77801/part*OVERWRITE  INTO  TABLE  output1   B4=Exclude(B3,"text")  Move  SQL  table  with  keywords  to  HIVE  through  Scoop  using  CLI   B5=GroupBy(B4,"text,created_at")   • export  -­‐-­‐connect  jdbc:mySQL://  -­‐-­‐password    mypasswd  -­‐-­‐username  root  -­‐-­‐table  words  -­‐-­‐export-­‐dir   B6=Cirro_Match(B5,"text","MYSQL.KeyWords","word",C9)   /home/cloudera/inpumile   • create  table  mytable  (word  STRING)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  ,  STORED  AS  TEXTFILE     Results  displayed  at  cell  C9   • LOAD  DATA  INPATH  /home/cloudera/inpumile/part*OVERWRITE  INTO  TABLE  mytable  Run  HIVE  query  using  JDBC  SQL  tool   • select  a.text  ,a.created_at  ,a.count  from  output1  a    join  mytable  b    on  (a.text    =  b.word  )    Import  results  into  Excel  using  Excel   ©2012 Cirro Inc. All rights reserved.
  25. 25. Bringing Big Data to the Desktop Corporate Overview ©2012 Cirro Inc. All rights reserved.
  26. 26. Perceptions & Questions Analyst: Robin BloorTwitter Tag: #briefr The Briefing Room
  27. 27. Big Data, Hot Data? The Bloor Group
  28. 28. Hadoop & The Big Data DynamicHadoop has become the de facto reservoir for data The Bloor Group
  29. 29. Hadoop & The Big Data Dynamic–  We witnessed something like this a long time ago, with ISAM files - before the advent of RDBMS–  The difference this time is that Hadoop has an ecosystem and it is growing–  Big Data (usually caught first by Hadoop) is mostly new data and mostly event data–  Hadoop is not (yet) a performance engine. It is an all-purpose capability–  It is delivering business benefits in a big way: it is hot…. The Bloor Group
  30. 30. BI CategoriesHINDSIGHT Regular reporting/operational BI, Excel OVERSIGHT Dashboards, OLAP, BPM, Excel Data mining, statistical analysis INSIGHT (trends and relationships)FORESIGHT Predictive analytics The Bloor Group
  31. 31. The New BI Universe (?) The Bloor Group
  32. 32. Data Sources Graph DBMS, XML Standard DBMS, NoSQL SQL Flat filesHadoop and MetadataHadoop Hub? ++ The Bloor Group
  33. 33. Problems Of The Data LayerHadoop is capable of ETL and often Hadoop is multi-role and hence used for ETL, but that usually can spawn multiple instances involves coding of a kind BI tools, which had good-enough The data layer is more interfaces to RDBMS, don’t link to complicated than it was and its Hadoop directly, and probably complexity is increasing shouldn’tPoint to point connectivity usually A connectivity architecture is was, is and may always be a bad needed idea IT REQUIRES SIMPLE CONNECTORS The Bloor Group
  34. 34. !  How would one use the Cirro Multi Store?!  Which companies/products do you regard as competitors (either directly or close competitors)?!  How does a Cirro implementation proceed, i.e., where do you start, what are the medium term goals, what do you replace?!  Conceptually a hub for the data layer is attractive. But how well does it scale out? The Bloor Group
  35. 35. !  Can the hub be physically distributed, i.e., one logical instance with multiple physical instances?!  How does your proprietary MapReduce differ from Hadoop MapReduce?!  Is there any aspect of BI that you don’t or can’t cater for (CEP, Data governance, MDM, etc.)? The Bloor Group
  36. 36. Twitter Tag: #briefr The Briefing Room
  37. 37. Upcoming Topics January: Big Data February: Analytics March: Data in Motion 2013 Editorial Calendar www.insideanalysis.comTwitter Tag: #briefr The Briefing Room
  38. 38. Thank You for Your AttentionTwitter Tag: #briefr The Briefing Room