Big Data and Analytics


Published on

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data and Analytics

  1. 1. copyright: Sixth Sense Advisors Inc @2012 1BIG DATA & ANALYTICSDAMA ChicagoApril 18th 2012
  2. 2. copyright: Sixth Sense Advisors Inc @2012The Buzz
  3. 3. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 3A Growing Trend Expectations for BI are changing w/o anyone telling us Requirement Expectations Reality Speed Speed of the Internet Speed = Infra + Arch + Design Accessibility Accessibility of a BI Tool licenses & Smartphone security Usability IPAD - Mobility Web Enabled BI Tool Availability Google Search Data & Report Metadata Delivery Speed of questions Methodology & Signoff Data Access to everything Structured Data Scalability Cloud (Amazon) Existing Infrastructure Cost Cell phone or Free Millions WIFI
  4. 4. copyright: Sixth Sense Advisors Inc @2012 4 Data DisruptionsPorter Competitive Model
  5. 5. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 5State of Data Today
  6. 6. copyright @Sixth Sense Advisors Inc 2012 6Future of Data
  7. 7. copyright: Sixth Sense Advisors Inc @2012 7 Big DataBig Data can be defined as data that can grow in volume, velocity, variety and complexity atunprecedented pace. The growth and complexity present challenges with the capture, storage,management, analysis and visualization using the typical BI tool stack
  8. 8. copyright: Sixth Sense Advisors Inc @2012 8 Tapping into the data Business Infrastructure Today we do Big or SmallStructured data compute with Small andused today Large structured data setsBig Data Big Data will mean Big orexisting across Small compute with Bigthe enterprise data sets, not alwaysthat can be available in structured ormade available semi-structured formatsto business
  9. 9. copyright: Sixth Sense Advisors Inc @2012 9Analytics•  Analytics is the key visualization technique to analyze and monetize from Big Data•  The field of analytics is resurging from the advent of Big Data •  Social Analytics •  Sensor Analytics •  Text Analytics •  Deep Data Mining•  Analytics needs metadata for integration•  Applications •  Fraud Detection •  Campaign Optimization •  Demand and Supply Optimization •  Forecast Optimization
  10. 10. copyright: Sixth Sense Advisors Inc @2012Long Tail The New Way (with a bigger, longer tail) The Old Way(Pareto Principle, Control or 80/20 rule) Source: 20% When Web 2.0 is applied…
  11. 11. copyright: Sixth Sense Advisors Inc @20122008 US Presidential Elections $32 million raised from 275,000 people who gave $100 or less
  12. 12. copyright: Sixth Sense Advisors Inc @2012 Long Tail Example Web 2.0 significantly increases total value contributed/received by aggregating the long tail of smaller value donors.High $ value donors, Smallconstellation Source: 20% Low $ value donors, Larger constellation BIG Data
  13. 13. copyright: Sixth Sense Advisors Inc @2012Brand Management
  14. 14. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 14What’s so Big about Big Data Velocity Volume Variety Complexity Ambiguity
  15. 15. copyright: Sixth Sense Advisors Inc @2012 15 What do we collect•  Facebook has an average of 30 billion pieces of content added every month•  YouTube receives 24hours of video, every minute•  5 Billion mobile phones in use in 2010•  A leading retailer in the UK collects 1.5 billion pieces of information to adjust prices and promotions• 30% of sales is out of its recommendation engine•  A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvements
  16. 16. copyright: Sixth Sense Advisors Inc @2012Potential Business Insights•  Trends •  Pharmaceutical Companies•  Brand Identity & •  Patient Education Management •  Physician Enriched Content Management•  Consumer Education •  Reduce Clinical Trial Cycles•  Competitive Intelligence and Errors•  Micro-Targeting Leverage •  Pharmacovigilance “Crowdsourcing” driven •  Financial innovation to better products •  Fraud and services (DELL, •  Customer Management Innocentive (SAP, P&G)) •  Manufacturing•  eDiscovery (Legal trends •  Supply chain optimization and patterns, financial •  Track & Trace fraud) •  Compliance
  17. 17. copyright: Sixth Sense Advisors Inc @2012 17 Why DWBI Fails Repeatedly Lost value =Business Value Sum (Latencies)+ Business Situation Opportunity Cost Data LatencyValueLost Data is ready Analysis Latency Information is available Decision Latency Decision is made Action time or Action distance TimeBase Graph Courtesy – Dr. Richard Hackathorn
  18. 18. copyright: Sixth Sense Advisors Inc @2012 18 The Data Landscape DatamartsTransactional Reports Systems ODS & Analytical Databases Dashboar Enterprise ds Datawarehous DatamartsTransactional Systems ODS e & Analytical Databases Analytic Models OtherTransactional Applicatio ODS Datamarts ns Systems & Analytical Databases Data Transformation
  19. 19. copyright: Sixth Sense Advisors Inc @2012 19ACID Kills•  Atomic – All of the work in a transaction completes (commit) or none of it completes•  Consistent – A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.•  Isolated – The results of any changes made during a transaction are not visible until the transaction has committed.•  Durable – The results of a committed transaction survive failures
  20. 20. copyright: Sixth Sense Advisors Inc @2012 20   BIG  Data  Scenarios  EXAMPLES  To: Bob.Collins@bankwithus.comDear Mr. Collins,This email is in reference to my bank account which hasbeen efficiently handled by your bank for more than fiveyears. There has been no problem till date until last weekthe situation went out of the hand.I have deposited one of my high amount cheque to mybank account no: 65656512 which was to be creditedsame day but due to your staff carelessness it wasn’tdone and because of this negligence my reputation in themarket has been tarnished. Furthermore I had issued onepayment cheque to the party which was showingbounced due to “Insufficient balance” just because mycheque didn’t make on time.My relationship with your bank has matured with the timeand it’s a shame to tell you about this kind of services arenot acceptable when it is question of somebody’sreputation. I hope you got my point and I am attaching acopy of the same for further rapid procedures and remitinto my account in a day.Yours sincerelyDaniel CarterPh: 564-009-2311
  21. 21. copyright: Sixth Sense Advisors Inc @2012 21 BIG Data Text Example •  We  will  o9en  imply  addi>onal  informa>on  in  spoken  language  by  the  way  we  place  stress   on  words.     •  The  sentence  "I  never  said  she  stole  my  money"  demonstrates  the  importance  stress  can   play  in  a  sentence,  and  thus  the  inherent  difficulty  a  natural  language  processor  can  have   in  parsing  it.     •  "I  never  said  she  stole  my  money"  -­‐  Someone  else  said  it,  but  I  didnt.     •  "I  never  said  she  stole  my  money"  -­‐  I  simply  didnt  ever  say  it.     •  "I  never  said  she  stole  my  money"  -­‐  I  might  have  implied  it  in  some  way,  but  I  never   explicitly  said  it.     •  "I  never  said  she  stole  my  money"  -­‐  I  said  someone  took  it;  I  didnt  say  it  was  she.     •  "I  never  said  she  stole  my  money"  -­‐  I  just  said  she  probably  borrowed  it.     •  "I  never  said  she  stole  my  money"  -­‐  I  said  she  stole  someone  elses  money.     •  "I  never  said  she  stole  my  money"  -­‐  I  said  she  stole  something,  but  not  my  money   •  Depending  on  which  word  the  speaker  places  the  stress,  this  sentence  could  have  several   dis>nct  meanings.  Example Source: Wikepedia
  22. 22. copyright: Sixth Sense Advisors Inc @2012 22 Pattern DetectionClustering Techniques Utilities K-Means Accuracy Measures Maximin Range Filters Agglomerative K-Fold Cross Validation Divisive Merge & Subset Regression Vector MagnitudeClassification Techniques Native Bayes Examples Neural Networks • Text – OCR, Machine, Digital Back Propogational •  Face recognition, verification, retrieval. Recursively Splitting •  Finger prints recognition. K-Nearest Neighbor •  Speech recognition. Minimum Distance •  Medical diagnosis: X-Ray, EKG analysis •  Machine diagnostics dataReduction Techniques •  Geological data Backward Elimination •  Automated Target Recognition (ATR). Forward Selection •  Image segmentation and analysis (recognition Attribute Removal from aerial or satelite photographs). Principal Components
  23. 23. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 23 So you are about to start the Big Data Project Tools Output Datainstructions
  24. 24. @2012 Copyright Sixth Sense Advisors 24The  Normal  Way  Results  In  ……..  
  25. 25. copyright: Sixth Sense Advisors Inc @2012 25 PerformanceRe-Engineering a Ferrari Engine in a Yugo does not make the fastestrace car. + New Data Types Current Data + New volume •  POOR Management + New Analytics Performance Platform •  Failed + New Data Retention(RDBMS + ETL Programs +BI) + New Data Workloads
  26. 26. copyright: Sixth Sense Advisors Inc @2012 26 Big  Data  and  You  •  You  need  to  write  data  quickly  and   reliably   •  Incoming  data  streams  are  different  in  type,   size,  complexity   •  But  wri>ng  it  to  disk  or  memory  is  not  the   ul>mate  goal  •  You  need  to  validate  data  in  real-­‐>me  •  You  need  to  count  and  aggregate  as   your  write  •  You  need  to  analyze  in  real-­‐>me  as  later   even  if  seconds  later  is  historical  •  You  need  to  scale-­‐up  and  scale-­‐out  on   demand  
  27. 27. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 27   BIG Dataü Workload Demands ü Infrastructure Needs ü  Process dynamic data ü  Scalable platform content ü  Database ü  Process unstructured independence data ü  Highly Fault tolerant ü  Systems that can architectures scale up and scale out ü  Commodity Platforms with high volume data ü  Supported by standard ü  Perform complex toolsets operations within reasonable response time
  28. 28. copyright: Sixth Sense Advisors Inc @2012 28 Data Warehouse ApplianceHigh Availability •  A Data Warehouse (DW) Appliance is an integratedStandard SQL Interface set of servers, storage, OS, database andAdvanced Compression interconnect specifically preconfigured and tunedMPP for the rigors of data warehousing.Leverages existing BI, ETL and OLTP investments •  DW appliances offer anHadoop & MapReduce Interface / Embedded attractive price / performance valueMinimal disk I/O bottleneck; simultaneously load & query proposition and are frequently a fraction of theAuto Database Management cost of traditional data warehouse solutions.
  29. 29. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 29Hadoop Design Goals ü  System Shall Manage and Heal Itself ü  Performance Shall Scale Linearly ü  Compute Shall Move to Data ü  Simple Core, Modular and Extensible
  30. 30. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 30 Hadoop Differentiators Schema-on-Write: RDBMS Schema-on-Read: Hadoop•  Schema must be created •  Data is simply copied to the file before data is loaded. store, no special transformation is needed.•  An explicit load operation has to take place which transforms •  A SerDe (Serializer/Deserlizer) the data to the internal is applied during read time to structure of the database. extract the required columns.•  New columns must be added •  New data can start flowing explicitly before data for such anytime and will appear columns can be loaded into retroactively once the SerDe is the database. updated to parse them.•  Read is Fast. •  Load is Fast•  Standards/Governance. •  Evolving Schemas/Agility
  31. 31. copyright: Sixth Sense Advisors Inc @2012 31Hadoop & RDBMS Analogy RDBMS Hadoop Sports car: Cargo train: •  refined •  rough •  has a lot of features •  missing a lot of luxury •  accelerates very fast •  slow to accelerate •  pricey •  carries almost anything •  expensive to maintain •  moves a lot of stuff very efficiently* Original Slide Author- Amr Adwallah , CloudEra
  32. 32. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 32Hadoop Known Limitations•  Note – All of these are being addressed by the committers this year and next•  Write-once model•  A namespace with an extremely large number of files exceeds Namenode’s capacity•  Cannot be mounted by existing OS •  Getting data in and out is tedious •  Virtual File System can solve problem•  HDFS does not implement / support •  User quotas •  Access permissions •  Hard or soft links •  Data balancing schemes•  No periodic checkpoints
  33. 33. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 33 Hadoop Tipsü Hadoop is useful ü Implementation ü  When you must process lots of ü  Think big, start small unstructured data ü  Build on agile cycles ü  When running batch jobs is ü  Focus on the data, as you will acceptable always develop schema on ü  When you have access to lots write. of cheap hardware ü  Available Optimizationsü Hadoop is not useful ü  Input to Maps ü  For intense calculations with ü  Map only jobs little or no data ü  Combiner ü  When your data is not self- ü  Compression ü  Speculation contained ü  Fault Tolerance ü  When you need interactive ü  Buffer Size results ü  Parallelism (threads) ü  Partitioner ü  Reporter ü  DistributedCache ü  Task child environment settings
  34. 34. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 34 Hadoop Tipsü Troubleshooting ü Performance Tuning ü  Are your partitions uniform? ü  Increase the memory/buffer ü  Can you combine records at the allocated to the tasks map side? ü  Increase the number of tasks that ü  Are maps reading off a DFS block can be run in parallel worth of data? ü  Increase the number of threads that ü  Are you running a single reduce serve the map outputs wave (unless the data size per ü  Disable unnecessary logging reducers is too big) ? ü  Turn on speculation ü  Have you tried compressing ü  Run reducers in one wave as they intermediate data & final data? tend to get expensive ü  Are there buffer size issues ü  Tune the usage of ü  Do you see unexplained “long tails” DistributedCache, it can increase ü  Are your CPU cores busy? efficiency ü  Is at least one system resource being loaded?
  35. 35. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 35NoSQL•  Stands for Not Only SQL•  Based on CAP Theorem / BASE•  Usually do not require a fixed table schema nor do they use the concept of joins•  All NoSQL offerings relax one or more of the ACID properties•  Scalable replication and distribution •  Potentially thousands of machines •  Potentially distributed around the world•  Queries need to return answers quickly•  Mostly query, few updates•  Asynchronous Inserts & Updates•  NoSQL databases come in a variety of flavors •  XML (myXMLDB, Tamino, Sedna) •  Wide Column (Cassandra, Hbase, Big Table) •  Key/Value (Redis, Memcached with BerkleyDB) •  Graph (neo4j, InfoGrid) •  Document store (CouchDB, MongoDB)
  36. 36. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 36 NoSQL Footprint Amazon Dynamo Key HBase Value Big Table Voldermort Google Big TableSize Lotus Notes Doc Graph Database Cassandra Theory Graph Complexity
  37. 37. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 37 NoSQLü  Access and Query ü  Best Practices ü  RESTful interfaces (HTTP as an ü  Design for data collection accessAPI) ü  Plan the data store ü  Query languages other than ü  Organize by type and SQL semantics ü  SPARQL - Query language ü  Partition for performance for the SemanticWeb ü  Access and Query is ü  Gremlin - the graph run time dependent traversal language ü  Horizontal scaling ü  Sones Graph Query ü  Memory Caching Language ü  Data Manipulation / Query API ü  The Google BigTable DataStoreAPI ü  The Neo4jTraversalAPI ü  Serialization Formats ü  JSON ü  Thrift ü  ProtoBuffers ü  RDF
  38. 38. copyright: Sixth Sense Advisors Inc @2012 38Map Reducen Technique for indexing and searching large data volumesn Two Phases, Map and Reduce n Map n Extract sets of Key-Value pairs from underlying data n Potentially in Parallel on multiple machines n Reduce n Merge and sort sets of Key-Value pairs n Results may be useful for other searches
  39. 39. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 39 Textual ETL EngineForest Rim Technology – Textual ETL Engine (TETLE) – is an integration tool for turning text into astructure of data that can be analyzed by standard analytical tools ü  Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data. ü  The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords ü  Easy to implement and easy to realize ROI ü  Advantages ü  Disadvantages ü  Simple to use ü  Not integrated with Hadoop as a ü  No MR or Coding required for text rules interface analysis and mining ü  Currently uses Sqoop for metadata ü  Extensible by Taxonomy integration interchange with Hadoop or NoSQL ü  Works on standard and new interfaces databases ü  Current GA does not handle ü  Produces a highly columnar key- distributed processing outside value store, ready for metadata Windows platform integration
  40. 40. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 40Integration•  All RDBMS vendors today are supporting Hadoop or NoSQL as an integration or extension •  Oracle Exalytics / Big Data Appliance •  Teradata Aster Appliance •  EMC Greenplum Appliance •  IBM BigInsights •  Microsoft Windows Azure Integration•  There are multiple providers of Hadoop distribution •  CloudEra •  HortonWorks •  Hadapt •  Zettaset •  IBM•  Adapters from vendors to interface with CloudEra or HortonWorks distributions of Hadoop are available today. There are integration efforts to release Hadoop as an integral engine across the RDBMS vendor platforms
  41. 41. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 41 Conceptual  Solu>on  Architecture   Metadata MDM ETL DataOLTP ELT Warehouse CDC DataMart’s Big DataBIG Data Textual DWContent ETL Email Taxonomy Docs And / Or MR / Ruby / Java (Hadoop)
  42. 42. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 42Which Tool Application Hadoop NoSQL Textual ETLMachine Learning x x Sentiments x x xText Processing x x xImage Processing x x Video Analytics x x Log Parsing x x x Collaborative x x x Filtering Context Search xEmail & Content x
  43. 43. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 43Integration Tips•  The key to the castle in integrating Big Data is metadata•  Whatever the tool, technology and technique, if you do not know your metadata, your integration will fail•  Semantic technologies and architectures will be the way to process and integrate the Big Data, much akin to Web 2.0 models•  Data quality for Big Data is a very questionable goal. To get some semblance of quality, taxonomies and ontologies can be of help•  3rd part data providers also provide keywords, trending tags and scores, these can provide a lot of integration support•  Writing business rules for Big Data can be very cumbersome and not all programs can be written in MapReduce
  44. 44. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 44Success  Stories   •  Machine learning & Recommendation Engines – Amazon, Orbitz •  CRM - Consumer Analytics, Metrics, Social Network Analytics, Churn, Sentiment, Influencer, Proximity •  Finance – Fraud, Compliance •  Telco – CDR, Fraud •  Healthcare – Provider / Patient analytics, fraud, proactive care •  Lifesciences – clinical analytics, physician outreach •  Pharma – Pharmacovigilance, clinical trials •  Insurance – fraud, geo-spatial •  Manufacturing – warranty analytics, supplier quality metrics
  45. 45. copyright: Sixth Sense Advisors Inc @2012 45 Big Data Challenges•  Integration to the EDW is still an open issue – Big Data reduces to small metrics, and this translates into the current state issues faced with EDW data•  Big Data requires lot of Taxonomy processing especially in Content related Search•  There are several applications that need high performing memory architectures as data is compute intensive – example image processing of brain scans•  Technology is improving by the day, but integration and deployment are becoming equally complex.
  46. 46. ©2012 Sixth Sense Advisors, Inc. All Rights Reserved 46Data Science Art & ScienceData Analytics APPLIED SCIENCE Content User Interest Prediction Customer inventory prediction Product Machine learning Behaviors Pattern Mining Optimization Advanced Regression Big Data Processing & ETL AnalysisBusiness Intelligence Advanced Analytics Business Analysts, Data Analysts, Metadata Architects, Data Architects are all in some evolutionary stage of a Data Scientist
  47. 47. copyright: Sixth Sense Advisors Inc @2012 47Contact Krish Krishnan Twitter - @datagenius