Customer summit - big data (final)


Published on

Presentation from the Persistent Customer Summit about Big Data

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Customer summit - big data (final)

  1. 1. BIG DATA Defined: Data Stack 3.0 Persistent Systems June 201224 July 2012 1
  2. 2. The Data Revolution is Happening Now The growing need for large-volume, multi- structured “Big Data” analytics, as well as … “Fast Data”, have positioned the industry at the cusp of the most radical revolution in database architectures in 20 years. We believe that the economics of data will increasingly drive competitive advantage. Source: Credit Suisse Research, Sept 201124 July 2012 2
  3. 3. Enterprise Value is Shifting to Data Data Apps ERP Database Operating SystemsMainframe24 July 2012 1975 1985 1995 2006 2013 3
  4. 4. What Data Can Do For You Organizational leaders want analytics to exploit their growing data and computational power to get smart, and get innovative, in ways they never could before. Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics and the Path From Insights to Value By Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz December 21, 201024 July 2012 4
  5. 5. Determining Shopping Patterns British Grocer, Tesco Uses Big Data by Applying Weather Results to Predict Demand and Increase SalesBritain often conjures images of unpredictable weather, with downpours sometimes followedby sunshine within the same hour — several times a day.Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its ownsoftware that calculates how shopping patterns change “for every degree of temperature andevery hour of sunshine.”Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier usiness/global/02wea ther.html 24 July 2012 5
  6. 6. Tracking Customers in Social Media Glaxo Smith Kline Uses Big Data to Efficiently Target CustomersGlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year usingsocial media as a base for research and multichannel marketing. Targeted offers andpromotions will drive people to particular brand websites where external data is integratedwith information already held by the marketing teams.Source: Big data: Embracing the elephant in the room By Steve Hemsley ta-embracing -the-elepha nt-in-the-room/3030939.article 24 July 2012 6
  7. 7. What does India Think? Persistent enables Aamir Khan Productions and Star Plus use Big Data to know how people react to some of the most excruciating social issues. Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught theinterest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS,Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. Thisdata is being analyzed and delivered in real-time to allow the producers to understand thepulse of the viewers, to gauge the appreciation for the show and most importantly to spreadthe message. Harnessing the truth from all this data is a key component of the show’s success. 24 July 2012 7
  8. 8. 24 July 2012 8
  10. 10. Relational Database Systems forOperational Store ● Transaction processing capabilities ideally suited for transaction-oriented operational stores. ● Data types – numbers, text, etc. ● SQL as the Query language ● De-facto standard as the operational store for ERP and mission critical systems. ● Interface through application programs and query tools24 July 2012 10
  11. 11. Enterprise Data Warehouse for DecisionSupport ● Operational data stores store on-line transactions – Many writes, some reads. ● Large fact table, multiple dimension tables ● Schema has a specific pattern – star schema ● Joins are also very standard and create cubes ● Queries focus on aggregates. ● Users access data through tools such as Cognos, Business Objects, Hyperion etc.24 July 2012 11
  12. 12. Standard Enterprise Data Architecture Presentation Layer Relational Databases Optimized Loader Extraction ERP Cleansing Application Logic Systems (ETL) Data Warehouse Engine Analyze Purchased Query Data Relational Databases Legacy Data Metadata RepositoryData Stack 1.0: Data Stack 2.0:Operational Data Systems Enterprise Data Warehouse Systems 24 July 2012 12
  13. 13. Despite the two data stacks .. One in two business executives believe that they do not have sufficient information across their organization to do their jobSource: IBM Institute for Business Value 24 July 2012 13
  14. 14. Data has Variety Less than 40% of the Enterprise Data is stored in Data Stack 1.0 or Data Stack 2.0.24 July 2012 14
  15. 15. Beyond the Operational Systems, data required for decision making is scattered within and beyond the enterprise Weather forecasts Expense TwitterEmail Systems Management Feeds Collaboration Vendor Demographic System /Wiki Sites Collaboration Data Organizational Systems Maps Employee Surveys Workflow Document Repositories Supply Chain Economic DataERP Systems Systems Customer Call Social CRM Systems Location and Center Records Networking Presence Data Enterprise Sensor Data Data Warehouse Project artifacts Data CRM Systems Structured Unstructured Cloud Public Data Sources Data Sources Data Sources Data Sources 24 July 2012 15
  16. 16. Data Volumes are Growing 5 Exabytes of information was created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing Eric Schmidt(1 exabyte = 1018 bytes ) at the Techonomy Conference, August 4, 201024 July 2012 16
  17. 17. The Continued Explosion of Data in the Enterprise and Beyond80% of new information growth isunstructured content –90% of that is currently unmanaged 2020 35 zettabytes 44x as much Data and Content 2009 Over Coming Decade 800,000 petabytes 1990 2000 2010 2020 Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010 24 July 2012 17
  18. 18. What comes first -- Structure or data? Schema/ Data Structure Structure First is Constraining24 July 2012 18
  19. 19. Time to create a new data stack for unstructured data. Data Stack 3.0.24 July 2012 19
  20. 20. The Path to Data Stack 3.0: Must support Variety, Volume and VelocityData Stack 1.0 Data Stack 2.0 Data Stack 3.0Relational Database Systems Enterprise Data Warehouse Dynamic Data PlatformRecording Business Events Support for Decision Making Uncovering Key InsightsHighly Normalized Data Un-normalized Dimensional Model Schema less ApproachGBs of Data TBs of Data PBs of DataEnd User Access through Ent Apps End User Access Through Reports End User Direct AccessStructured Structured Structured + Semi Structured 24 July 2012 20
  21. 21. Can Data Stack 3.0 Address Real Problems? Large Data Diverse Data Queries that Answer QueriesVolume at Low beyond Are Difficult to that No One Price Structured Data Answer Dare Ask 24 July 2012 21
  22. 22. Time-out! Internet companies have already addressed the same problems.24 July 2012 22
  23. 23. Internet Companies have to deal with large volumes of unstructured real-time data. ● Twitter has 140 million active users and more than 400 million tweets per day. ● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day. ● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015. ● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.24 July 2012 23
  24. 24. Their data loads and pricing requirementsdo not fit traditional relational systems ● Hosted service ● Large cluster (1000s of nodes) of low-cost commodity servers. ● Very large amounts of data -- Indexing billions of documents, video, images etc.. ● Batch updates. ● Fault tolerance. ● Hundreds of Million users, ● Billions of queries every day.24 July 2012 24
  25. 25. They built their own systems● It is the platform that distinguishes them from everyone else.● They required: – high reliability across data centers – scalability to thousands of network nodes – huge read/write bandwidth requirements – support for large blocks of data which are gigabytes in size. – efficient distribution of operations across nodes to reduce bottlenecksRelational databases were not suitable and would have beencost prohibitive.24 July 2012 25
  26. 26. Internet Companies have open-sourced thesource code they created for their own use.Companies havecreated businessmodels to supportand enhance thissoftware.24 July 2012 26
  27. 27. Open Source Rules ! Hadoop Infrastructure24 July 2012 27
  28. 28. What about support !24 July 2012 28
  29. 29. Enterprises Always had Data.Now there is a way to handle it! Allows for analysis of massive volumes of information • Structured and Unstructured • External and Internal Thousands of users, millions of files, terabytes of data needs to be handled Commoditized hardware can be used to reduce costs Big Data can and should integrate with existing enterprise information architecture24 July 2012 29 Only Big Data makes it possible!
  30. 30. PERSISTENT SYSTEMS AND BIG DATA24 July 2012 30
  31. 31. Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solutionthat offers a direct path to unlock the value in your data.
  32. 32. Big Data Expertise at Persistent● 10+ projects executed with Leading ISVs and Enterprise Customers● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem (formed 3 years ago)● Engaged with the Big Data Ecosystem, including leading ISVs and experts • Preferred Big Data Services Partner of IBM and Microsoft24 July 2012
  33. 33. Big Data Leadership and Contributions● Code Contributions to Big Data Open Source Projects, including: – Hadoop, Hive, and SciDB● Dedicated Hadoop cluster in Persistent● Created PeBAL – Persistent Big Data Analytics Library● Created Visual Programming Environment for Hadoop● Created Data Connectors for Moving Data● Pre-built Solutions to Accelerate Big Data Projects24 July 2012 33
  34. 34. Persistent’s Big Data Offerings 1. Setting up and Maintaining Big Data Platform 2. Data Analytics on Big Data Platform 3. Building Applications on Big Data Technology Assets People Assets Persistent Pre-built Persistent Pre-built Persistent Pre-built Big Data Custom Industry Solution: Industry Solution: Industry Solution: Services Retail Banking Telco Extension of Persistent Pre-built Horizontal Solutions Your TeamVisual Programming (Email, Text, IT Analytics, … ) Discovery Workshop Training for Your Team Persistent Platform Enhancement IP Tools (PeBAL Analytics Library, Data Connectors) Methodology Foundational Infrastructure and Platform Team Formation Process (Built Upon Selected 3rd Party Big Data Platforms and Technologies; Cluster Sizing/Config Cluster of Commodity Hardware) 24 July 2012 34
  35. 35. Persistent Next Generation Data Architecture Reports BI Tools & Alerts Email Email Connector Framework Media Connector Framework Server Server Admin AppWeb ProxyWeb Proxy SolutionsIBM Tivoli Workflow Integration Persistent Analytics Library (PEBAL) NoSQL Graph Fn Set Fn …. ….. ….. Text Analytics Fn BBCA Text Analytics/ Social PIG/Jqal Hive Connector GATE/SystemT Twitter, RDBMSFacebook MapReduce and HDFS Cluster Monitoring Data DW Warehouse Commercial/ Open Persistent IP External Data source Source Product 24 July 2012 35
  36. 36. Persistent Big Data Analytics Library WHY PEBAL • Lots of common problems – not all of them are solved in Map Reduce • PigLatin, Hive, JAQL are languages and not libraries – something is needed to run on top that is not tied to SQL like interaces FEATURES • Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms. • PeBAL functions are schema agnostic. • All PeBAL functions are tried and tested against well defined use cases. BENEFITS OF A READY MADE SOLUTION • Proven – well written and tested • Reuse across multiple applications • Quicker implementation of map reduce applications24 July 2012 • High performance 36
  37. 37. Web Analytics Text Inverted Analytics Lists Set Graph Statistics24 July 2012 37
  38. 38. Visual Programming Environment ADOPTION BARRIERS • Steep Learning Curve • Difficult to Code • Ad-hoc reporting can’t always be done by writing programs • Limited tooling available VISUAL PROGRAMMING ENVIRONMENT • Use Standard ETL tool as the UI environment for generating PIG scripts BENEFITS • ETL Tools are widely used in Enterprises • Can leverage large pool of skilled people who are experts in ETL and BI tools • UI helps in iterative and rapid data analysis • More people will start using it24 July 2012 38
  39. 39. Visual Programming Environment for Hadoop DataSources ETL Tool Data Flow UI Metadata PIG Convertor PIG code PIG UDF Library HDFS/ Hive Data Data HDFS HDFS Big Data PlatformPersistent IP 24 July 2012 39
  40. 40. Persistent Connector Framework 20+ WHY CONNECTOR FRAMEWORK Years • Pluggable Architecture OUT OF THE BOX • Database, Data Warehouse • Microsoft Exchange • Web proxy • IBM Tivoli • BBCA • Generic Push connector for *any* content FEATURES • Bi-directional connector (as applicable) • Supports Push/Pull mechanism • Stores data on HDFS in an optimized format24 July 2012 • Supports masking of data 40
  41. 41. Persistent Data Connectors24 July 2012 41
  42. 42. Persistent’s Breadth of Big Data Capabilities Tooling Horizontal and Vertical Pre-built Solutions • RDBMS/DWH to import/export data • Text Analytics libraries • Data Visualization using Web2.0 and reporting tools Big Data Platform (PeBAL) analytics - Cognos, Microstrategy libraries and Connectors • Ecosystem tools like - Nutch, Katta, Lucene • Job configuration, management and monitoring with BIgInsight’s job IT Management scheduler (MetaTracker) • Job failure and recovery management Big Data Application Programming • Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs, Integration of third party tools/libraries, Performance tuning, ETL using JAQL• HDFS Distributed • Expertise in MR programming - PIG, Hive, Java MR• IBM GPFS File Systems • Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)• Platform Setup on multi- Cluster node clusters, Layer • Statistical Analytics - R, SPSS, BigInsights Integration with R monitoring, VM based setup Persistent IP for Big Data Solutions• Product Deployment 24 July 2012 Big Data Platform Components 42
  43. 43. Persistent Roadmap to Big Data Improve Knowledge Base 1. Learn Discover andand Shared Big Data Platform Define Use Cases 5. Manage 2. Initiate Measure Effectiveness Validate with and Business Value a POC 4. Measure 3. Scale Upgrade to Production if Successful 24 July 2012 43
  44. 44. Customer AnalyticsIdentifying your mostinfluential customers ? Target these customers for Identify promotions. influential Overlay sales customers data on the using network Build a social graph analysis Few thousand graph of all > 1billion transactions Influential customers customers over twenty years 70 million customers Targeting influential customers is best way to24 July 2012 improve campaign ROI! 44
  45. 45. Overview of Email Analytics● Key Business Needs – Ensure compliance with respect to a variety of business and IT communications and information sharing guidelines. – Provide an ongoing analysis of customer sentiment through email communications.● Use Cases – Quickly identify if there has been an information breach or if the information is being shared in ways that is not in compliance with organizational guidelines. – Identify if a particular customer is not being appropriately managed.● Benefits – Ability to proactively manage email analytics and communications across the organization in a cost-effective way. – Reduce the response time to manage a breach and proactively address issues that emerge through ongoing analysis of email.24 July 2012 45
  46. 46. Using Email to Analyze Customer SentimentSense the mood of your customersthrough their emailsCarry out detailed analysis on customerteam interactions and response times 24 July 2012 46
  47. 47. Analyzing Prescription Data 1.5 million patients are harmed by medication errors every year Identifying erroneous prescriptions can save lives!24 July 2012 Source: Center for Medication Safety & Clinical Improvement 47
  48. 48. Overview of IT Analytics● Key Business Needs – Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring analysis of data from various systems. – Information may be in different formats, locations, granularity, data stores. – System outages have a negative impact on short-term revenue, as well as long-term credibility and reliability. – The ability to quickly identify if a particular system is unstable and take corrective action is imperative.● Use Cases – Identify security threats and isolate the corresponding external factors quickly. – Identify if an email server is unstable, determine the priority and take preventative action before a complete failure occurs.● Benefits – Reduced maintenance cost – Higher reliablity and SLA compliance24 July 2012 48
  49. 49. Consumer Insight from Social MediaFind out what the customers aretalking about your organization orproduct in the social media 24 July 2012 49
  50. 50. Insights for Satyamev Jayate – Variety ofsources Web/TV Viewer Response to Pledge multiple choice Web, Social Media2. Unstructured Analysis 1. Structured Analysis questions (unstructured)Responses to following questions Responses to Pledge, Social Media (Structured)• Share your story multiple choice Web, emails, IVR/Calls• Ask a question to Aamir questions Individual blogs SMS• Send a message of hope IVR• Share your solution Social widgets VideosContent Filtering Rating TaggingSystem (CFRTS) …L0, L1, L2 phased analytics 3. Impact Analysis Crawling general internet for measuring the before & after scenario on a particular topic
  51. 51. Rigorous WeeklyOperation Cycleproducing instantanalyticsKiller combo of Human+Software toanalyze the data efficiently Topic opens on Sunday Episode Tags are refined and Live Analytics messages are re- report is sent ingested for during the show another pass Featured content Data capture is delivered thrice from SMS, phone a day all through calls, social out the week. media, website, JSONs are created for the System runs L0 external and Analysis, L1, L2 internal Analysts continue dashboards
  52. 52. 24 July 2012 52
  53. 53. Thank you Anand Deshpande ( Persistent Systems Limited www.persistentsys.com24 July 2012 53
  54. 54. Next Generation Sequencing Sequencing machines are getting affordable Running cost of sequencing is going down NGS machines generate TBs of data per week. Need to analyze this data in time Analysis results are critical for human life, personalized medicines24 July 2012 54