Integrating Big Data Technologies


Published on

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Integrating Big Data Technologies

  1. 1. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 1INTEGRATING BIGDATADataversity WebinarFeb 7 2012
  2. 2. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 2State of Data Today
  3. 3. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 3A Growing Trend Expectations for BI are changing w/o anyone telling us Requirement Expectations Reality Speed Speed of the Internet Speed = Infra + Arch + Design Accessibility Accessibility of a BI Tool licenses & Smartphone security Usability IPAD - Mobility Web Enabled BI Tool Availability Google Search Data & Report Metadata Delivery Speed of questions Methodology & Signoff Data Access to everything Structured Data Scalability Cloud (Amazon) Existing Infrastructure Cost Cell phone or Free WIFI Millions
  4. 4. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 4The  Wisdom  of  Crowds  
  5. 5. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 5Data  Deluge  =  Business  Insights  
  6. 6. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 6   BIG  Data  Structured Current New ERP CRM SCM Content Management Systems Email Call Center Documents ContractsUnStructured
  7. 7. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 7What’s so Big about Big Data Velocity Volume Variety Complexity Ambiguity
  8. 8. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 8 So you are about to start the Big Data Project Tools Output Datainstructions
  9. 9. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 9   The  Normal  Way  Results  In  ……..  Image Source: Web
  10. 10. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 10   Why  Big  Data  can  Fail  on  the  RDBMS?   New Data Types Current New volume Data •  POOR Management New analytics Performance Platform •  Failed(RDBMS + ETL New workload Programs +BI) New metadata Scalability; Sharding; ACID;
  11. 11. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 11  BIG Data•  Workload Demands •  Infrastructure •  Process dynamic data Requirements content •  Scalable platform •  Process unstructured •  Database independence data •  Fault tolerant •  Systems that can scale architectures up and scale out with •  Low cost of acquisition high volume data and store •  Perform complex •  Supported by standard operations within toolsets reasonable response time
  12. 12. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 12Hadoop Design Goals ü  System Shall Manage and Heal Itself ü  Performance Shall Scale Linearly ü  Compute Shall Move to Data ü  Simple Core, Modular and Extensible
  13. 13. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 13Hadoop Differentiators Schema-on-Write: RDBMS Schema-on-Read: Hadoop•  Schema must be created •  Data is simply copied to the file before data is loaded. store, no special transformation is needed.•  An explicit load operation has to take place which transforms •  A SerDe (Serializer/Deserlizer) the data to the internal is applied during read time to structure of the database. extract the required columns.•  New columns must be added •  New data can start flowing explicitly before data for such anytime and will appear columns can be loaded into retroactively once the SerDe is the database. updated to parse them.•  Read is Fast. •  Load is Fast•  Standards/Governance. •  Evolving Schemas/Agility
  14. 14. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 14Hadoop Known Limitations•  Write-once model•  A namespace with an extremely large number of files exceeds Namenode’s capacity to maintain•  Cannot be mounted by exisiting OS •  Getting data in and out is tedious •  Virtual File System can solve problem•  HDFS does not implement / support •  User quotas •  Access permissions •  Hard or soft links •  Data balancing schemes•  No periodic checkpoints•  Namenode is single point of failure •  Automatic restart and failover to another machine not yet supported
  15. 15. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 15 Hadoop Tips•  Hadoop is useful •  Implementation •  When you must process lots of •  Think big, start small unstructured data •  Build on agile cycles •  When running batch jobs is •  Focus on the data, as you will acceptable always develop schema on •  When you have access to lots of write. cheap hardware •  Available Optimizations•  Hadoop is not useful •  Input to Maps •  For intense calculations with little or •  Map only jobs no data •  Combiner •  When your data is not self-contained •  Compression •  Speculation •  When you need interactive results •  Fault Tolerance •  Buffer Size •  Parallelism (threads) •  Partitioner •  Reporter •  DistributedCache •  Task child environment settings
  16. 16. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 16 Hadoop Tips•  Troubleshooting •  Performance Tuning •  Are your partitions uniform? •  Increase the memory/buffer allocated •  Can you combine records at the map to the tasks side? •  Increase the number of tasks that can •  Are maps reading off a DFS block be run in parallel worth of data? •  Increase the number of threads that •  Are you running a single reduce wave serve the map outputs (unless the data size per reducers is •  Disable unnecessary logging too big) ? •  Turn on speculation •  Have you tried compressing •  Run reducers in one wave as they intermediate data & final data? tend to get expensive •  Are there buffer size issues •  Tune the usage of DistributedCache, •  Do you see unexplained “long tails” it can increase efficiency •  Are your CPU cores busy? •  Is at least one system resource being loaded?
  17. 17. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 17NoSQL•  Stands for Not Only SQL•  Based on CAP Theorem•  Usually do not require a fixed table schema nor do they use the concept of joins•  All NoSQL offerings relax one or more of the ACID properties•  NoSQL databases come in a variety of flavors •  XML (myXMLDB, Tamino, Sedna) •  Wide Column (Cassandra, Hbase, Big Table) •  Key/Value (Redis, Memcached with BerkleyDB) •  Graph (neo4j, InfoGrid) •  Document store (CouchDB, MongoDB)
  18. 18. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 18 NoSQL Footprint Key Amazon Dynamo Value Voldermort Big Google Big Table TableSize HBase Lotus Notes Doc Database Cassandra Graph Graph Theory Complexity
  19. 19. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 19 NoSQL•  Access and Query •  Best Practices •  RESTful interfaces (HTTP as an •  Design for data collection accessAPI) •  Plan the data store •  Query languages other than SQL •  Organize by type and semantics •  SPARQL - Query language for •  Partition for performance the SemanticWeb •  Access and Query is run time •  Gremlin - the graph traversal dependent language •  Horizontal scaling •  Sones Graph Query Language •  Memory Caching •  Data Manipulation / Query API •  The Google BigTable DataStoreAPI •  The Neo4jTraversalAPI •  Serialization Formats •  JSON •  Thrift •  ProtoBuffers •  RDF
  20. 20. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 20 Textual ETL EngineForest Rim Technology – Textual ETL Engine (TETLE) – is an integration tool for turning text into a structure ofdata that can be analyzed by standard analytical tools •  Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data. •  The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords •  Easy to implement and easy to realize ROI•  Advantages •  Disadvantages •  Simple to use •  Not integrated with Hadoop as a rules •  No MR or Coding required for text analysis interface and mining •  Currently uses Sqoop for metadata •  Extensible by Taxonomy integration interchange with Hadoop or NoSQL •  Works on standard and new databases interfaces •  Produces a highly columnar key-value •  Current GA does not handle distributed store, ready for metadata integration processing outside Windows platform
  21. 21. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 21Integration•  All RDBMS vendors today are supporting Hadoop or NoSQL as an integration or extension •  Oracle Exalytics / Big Data Appliance •  Teradata Aster Appliance •  EMC Greenplum Appliance •  IBM BigInsights •  Microsoft Windows Azure Integration•  There are multiple providers of Hadoop distribution •  CloudEra •  HortonWorks •  Zettaset•  Adapters from vendors to interface with CloudEra or HortonWorks distributions of Hadoop are available today. There are integration efforts to release Hadoop as an integral engine across the RDBMS vendor platforms
  22. 22. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 22 Conceptual  SoluEon  Architecture   Metadata MDM ETL DataOLTP ELT Warehouse Reporting CDC Analytics DataMart’s Search OLAP Text Mining Big Data Content AnalyticsBIG Data Textual DW Knowledge AnalyticsContent ETL Email Taxonomy Docs And / Or MR / Ruby / Java (Hadoop)
  23. 23. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 23Integration Tips•  The key to the castle in integrating Big Data is metadata•  Whatever the tool, technology and technique, if you do not know your metadata, your integration will fail•  Semantic technologies and architectures will be the way to process and integrate the Big Data, much akin to Web 2.0 models•  Data quality for Big Data is a very questionable goal. To get some semblance of quality, taxonomies and ontologies can be of help•  3rd part data providers also provide keywords, trending tags and scores, these can provide a lot of integration support•  Writing business rules for Big Data can be very cumbersome and not all programs can be written in MapReduce
  24. 24. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 24Which Tool Application Hadoop NoSQL Textual ETLMachine Learning x x Sentiments x x xText Processing x x xImage Processing x x Video Analytics x x Log Parsing x x x Collaborative x x x Filtering Context Search xEmail & Content x
  25. 25. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 25Success  Stories   •  Machine learning & Recommendation Engines – Amazon, Orbitz •  CRM - Consumer Analytics, Metrics, Social Network Analytics, Churn, Sentiment, Influencer, Proximity •  Finance – Fraud, Compliance •  Telco – CDR, Fraud •  Healthcare – Provider / Patient analytics, fraud, proactive care •  Lifesciences – clinical analytics, physician outreach •  Pharma – Pharmacovigilance, clinical trials •  Insurance – fraud, geo-spatial •  Manufacturing – warranty analytics, supplier quality metrics
  26. 26. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 26Data ScienceData Analytics Art & Science APPLIED SCIENCE Content User Interest Prediction Customer inventory prediction Product Machine learning Behaviors Pattern Mining Optimization Advanced Regression Big Data Processing & ETL AnalysisBusiness Intelligence Advanced Analytics
  27. 27. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 27Challenges   •  Resources  Availability   •  MR  is  hard  to  implement   •  Speech  to  text   •  ConversaEon  context  is  oJen  missing   •  Quality  of  recording   •  Accent  issues   •  Visual  data  tagging   •  Images   •  Text  embedded  within  images   •  Metadata  is  not  available   •  Data  is  not  trusted     •  Content  management  plaMorm  capabiliEes   •  Ontologies  Ambiguity   •  Taxonomy  IntegraEon  
  28. 28. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 28Contact•  Krish Krishnan Twitter: @datagenius