Successfully reported this slideshow.

From the Big Data keynote at InCSIghts 2012


Published on

Presentation that I used for the CSI Annual Event InCSIghts 2012.

Excellent interactive session.

Published in: Technology

From the Big Data keynote at InCSIghts 2012

  1. 1. BIG DATA Defined: Data Stack 3.0 Anand Deshpande Persistent Systems December 20128 December 2012 1
  2. 2. Congratulations to the Pune Chapter Best Chapter Award at CSI 2012 Kolkata8 December 2012 2
  3. 3. COMAD 2012 14-16 December Pune Coming to India Delhi 20168 December 2012 3
  4. 4. The Data Revolution is HappeningNowThe growing need for large-volume, multi-structured “Big Data” analytics,as well as … “Fast Data”, have positioned theindustry at the cusp of the most radicalrevolution in database architectures in 20years.We believe that the economics of data willincreasingly drive competitive advantage. Source: Credit Suisse Research, Sept 20118 December 2012 4
  5. 5. What Data Can Do For YouOrganizational leaders want analyticsto exploit their growing data andcomputational power to get smart,and get innovative, in ways they nevercould before.Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analyticsand the Path From Insights to Value By Steve LaValle, Eric Lesser,Rebecca Shockley, Michael S. Hopkins and Nina KruschwitzDecember 21, 20108 December 2012 5
  6. 6. Determining Shopping Patterns British Grocer, Tesco Uses Big Data by Applying Weather Results to Predict Demand and Increase SalesBritain often conjures images of unpredictable weather, with downpours sometimes followedby sunshine within the same hour — several times a day.Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its ownsoftware that calculates how shopping patterns change ―for every degree of temperature andevery hour of sunshine.‖Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier 8 December 2012 6
  7. 7. Tracking Customers in Social Media Glaxo Smith Kline Uses Big Data to Efficiently Target CustomersGlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year usingsocial media as a base for research and multichannel marketing. Targeted offers andpromotions will drive people to particular brand websites where external data is integratedwith information already held by the marketing teams.Source: Big data: Embracing the elephant in the room By Steve Hemsley 8 December 2012 7
  8. 8. What does India Think? Persistent enabled Aamir Khan Productions and Star Plus use Big Data to know how people react to some of the most excruciating social issues. Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught theinterest of the entire nation. It has already generated ~7.5M responses in 4 weeks overSMS, Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the worldover. This data is being analyzed and delivered in real-time to allow the producers tounderstand the pulse of the viewers, to gauge the appreciation for the show and mostimportantly to spread the message. Harnessing the truth from all this data is a key componentof the show’s success. 8 December 2012 8
  9. 9. 8 December 2012 9
  11. 11. Relational Database Systems forOperational Store ● Transaction processing capabilities ideally suited for transaction-oriented operational stores. ● Data types – numbers, text, etc. ● SQL as the Query language ● De-facto standard as the operational store for ERP and mission critical systems. ● Interface through application programs and query tools8 December 2012 11
  12. 12. Data Stack 1.0:Online Transactions Processing (OLTP) ● High throughput for transactions (writes). ● Focus on reliability – ACID Properties. ● Highly normalized Schema. ● Interface through application programs and query tools Data Stack 1.08 December 2012 12
  13. 13. Data Stack 2.0: Enterprise DataWarehouse for Decision Support ● Operational data stores store on-line transactions – Many writes, some reads. ● Large fact table, multiple dimension tables ● Schema has a specific pattern – star schema ● Joins are also very standard and create cubes ● Queries focus on aggregates. ● Users access data through tools such as Cognos, Business Objects, Hyperion etc.8 December 2012 13
  14. 14. Data Stack 2.0: Enterprise DataWarehouse Data Reports & Warehouse Ad hoc Anal Alerts & Dashboards ETL What-if Anal. EPM Predictive User Analytics OLAP Data Staging Data Data Store8 December 2012 Visualization 14
  15. 15. Standard Enterprise DataArchitecture Presentation Layer Relational Databases Optimized Loader Extraction ERP Cleansing Application Logic Systems (ETL) Data Warehouse Engine Analyze Purchased Query Data Relational Databases Legacy Data Metadata RepositoryData Stack 1.0: Data Stack 2.0:Operational Data Systems Enterprise Data Warehouse Systems8 December 2012 15
  16. 16. Who are the players Oracle Microsoft Open Source Pure Play SQL Server IBM Infosphere Business Enterprise Data Informatica Oracle DataETL Integration DataStage Objects Data Kettle integration Powercenter Integrator Service (SSIS) I Integrator server Parallel Data Teradata, Oracle Netezza (Pure Postgres/DWH Warehouse(PD Sybase iQ <BLANK> Greenplum 11g/Exadata Data) MySQL W) (EMC), SQL Server Hyperion/ CognosOLAP Analysis SAP Hana Mondrian OLAP Viewer Essbase Powerplay Services(SSAS) Business Enterprise Oracle BI – SQL Server BIRT MicroStrategy Objects , BO Guide, WebReporting OBIEE) & Reporting Cognos BI Pentaho, Qliktech, Dashboard Report Studio Exalytics Services (SSRS) Jasper Tableau Builder or; SQL ServerPredictive Oracle Data SAS Enterprise Data Mining SPSS SAP Hana + R R/WekaAnalytics Mining (ODM) Miner (SSDM) 8 December 2012 16
  17. 17. Despite the two data stacks .. One in two business executives believe that they do not have sufficient information across their organization to do their jobSource: December 2012 Business Value 8 IBM Institute for 17
  18. 18. Data has Variety: it doesn’t fit Less than 40% of the Enterprise Data makes its way to Data Stack 1.0 or Data Stack 2.0.8 December 2012 18
  19. 19. Beyond the Operational Systems, data required for decision making is scattered within and beyond the enterprise Weather forecasts Expense Twitter Email Systems Management Feeds Collaboration Vendor Demographic System /Wiki Sites Collaboration Data Organizational Systems Maps Employee Surveys Workflow Document Repositories Supply Chain Economic DataERP Systems Systems Customer Call Social CRM Systems Location and Center Records Networking Presence Data Enterprise Sensor Data Data Warehouse Project artifacts Data CRM Systems Structured Unstructured Cloud Public Data Sources Data Sources Data Sources Data Sources 8 December 2012 19
  20. 20. Data Volumes are Growing 5 Exabytes of information was created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing Eric Schmidt(1 exabyte = 1018 bytes ) at the Techonomy Conference, August 4, 20108 December 2012 20
  21. 21. The Continued Explosion of Data in the Enterprise and Beyond80% of new information growth isunstructured content –90% of that is currently unmanaged 2020 35 zettabytes 44x as much Data and Content 2009 Over Coming Decade 800,000 petabytes 1990 Source:2000 Digital Universe Decade – Are You Ready?, May 2010 IDC, The 2010 2020 8 December 2012 21
  22. 22. What comes first -- Structure or data? Schema/ Data Structure Structure First is Constraining8 December 2012 22
  23. 23. Time to create a new data stack for unstructured data. Data Stack 3.0.8 December 2012 23
  24. 24. Time-out! Internet companies have already addressed the same problems.8 December 2012 24
  25. 25. Internet Companies have to deal with large volumes of unstructured real-time data. ● Twitter has 140 million active users and more than 400 million tweets per day. ● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day. ● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015. ● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.8 December 2012 25
  26. 26. Their data loads and pricingrequirements do not fit traditionalrelational systems ● Hosted service ● Large cluster (1000s of nodes) of low-cost commodity servers. ● Very large amounts of data -- Indexing billions of documents, video, images etc.. ● Batch updates. ● Fault tolerance. ● Hundreds of Million users, ● Billions of queries every day.8 December 2012 26
  27. 27. They built their own systems● It is the platform that distinguishes them from everyone else.● They required: – high reliability across data centers – scalability to thousands of network nodes – huge read/write bandwidth requirements – support for large blocks of data which are gigabytes in size. – efficient distribution of operations across nodes to reduce bottlenecksRelational databases were not suitable and would have beencost prohibitive.8 December 2012 27
  28. 28. Internet Companies have open-sourcedthe source code they created for theirown use.Companies havecreated businessmodels to supportand enhance thissoftware.8 December 2012 28
  29. 29. What did the Internet Companies build?And how did they get there? They started with a clean slate!
  30. 30. What features from the relationaldatabase can be compromised?Do we need ..● transaction support? Must have● rigid schemas? ● Scale● joins? ● Ability to handle unstructured● SQL? data● on-line, live updates? ● Ability to process large volumes of data without having to start with structure first. ● leverage distributed computing
  31. 31. Rethinking ACID properties Atomicity For the internet workload, with Consistency distributed computing, ACID Isolation Durability properties are too strong. Rather than requiring consistency Basic after every transaction, it is enough Availability for the database to eventually be in Soft-state a consistent state -- BASE. Eventual consistency
  32. 32. Brewer’s CAP Theorem for DistributedSystems● Consistent – Reads always pick up the latest write. Consistency Availability CA● Available – can always read and write.● Partition tolerant – The system CP AP can be split across multiple machines and datacenters Partition ToleranceCan do at most two of these three.
  33. 33. Essential BuildingBlocks for InternetData Systems Map Reduce Jobs (Developers)Tracker Hadoop Map-Reduce Layer Job Hadoop Distributed File System (HDFS) C L U S T E R
  34. 34. “For the last several years, every company involvedin building large web-scale systems has facedsome of the same fundamental challenges. Whilenearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach tobreaking down large problems is the only way toscale, doing so is not easy”- Jeremy Zawodny @Yahoo !
  35. 35. Challenges with DistributedComputing● Cheap nodes fail, especially if you have many Mean time between failures for 1 node = 3 years Mean time between failures for 1000 nodes = 1 day – Solution: Build fault-tolerance into system● Commodity network = low bandwidth – Solution: Push computation to the data● Programming distributed systems is hard – Solution: Data-parallel programming model: users write “map” & “reduce” functions, system distributes work and handles faults
  36. 36. The Hadoop Ecosystem ● HDFS – distributed, fault tolerant file system ● MapReduce – framework for writing/executing distributed, fault tolerant algorithms ● Hive & Pig – SQL-like declarative languages ● Sqoop – package for moving data between HDFS and relational DB systems ● + Others… BI ETL RDBMS Reporting Tools Hive & Pig (Serialization) Zookeeper Map/ Sqoop Avro Reduce HBase HDFS 36
  37. 37. Reliable Storage is Essential ● Google GFS; Hadoop HDFS; Kosmix KFS large distributed log structured file system that stores all types of data. ● Provides global file namespace ● Typical usage pattern – Huge files (100s of GB to TB) – Data is rarely updated in place – Reads and appends are common ● A new application coming on line can use an existing GFS cluster or they can make your own. ● File system can be tuned to fit individual application needs.
  38. 38. Distributed File System● Chunk Servers – File is split into contiguous chunks – Typically each chunk is 16-64MB – Each chunk replicated (usually 2x or 3x) – Try to keep replicas in different racks● Master node – a.k.a. Name Nodes in HDFS – Stores metadata – Might be replicated
  39. 39. Now that you have storage,how would you manipulate this data?● Why use MapReduce? – Nice way to partition tasks across lots of machines. – Handle machine failure – Works across different application types, like search and ads. – You can pre-compute useful data, find word counts, sort TBs of data, etc. – Computation can automatically move closer to the IO source.
  40. 40. Hadoop is the Apacheimplementation of MapReduce● The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.● The Apache Hadoop software library is a framework that allows: – distributed processing of large data sets across clusters of computers using a simple programming model. – It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. – Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  41. 41. Hadoop MapReduce Flow
  42. 42. Word Count – Distributed SolutionInput Map Shuffle & Sort Reduce Output the, 1 the quick brown, 2 Map brown, 1brown fox fox, 1 Reduce fox, 2 quick, 1 the, 1 brown, [1,1] how, 1 fox, 1 fox, [1,1]the fox ate now, 1 Map the, 1 ate, 1 how, [1] now, [1]the mouse the, [1,1,1,1] the, 4 mouse, 1 how, 1 now, 1 ate, 1 brown, 1 Reduce cow, 1how now Map the, 1 ate, [1] cow, 1 mouse, 1 the cow, [1] mouse, [1] quick, 1brown cow quick, [1]
  43. 43. Word Count in Map-Reduce public void map(Object key, Text value, …. ) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { map word.set(itr.nextToken()); context.write(word, one); } public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; reduce for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }
  44. 44. Pig and Hive● Pig and Hive provide a Pig is a data flow wrapper to make it easier to write MapReduce jobs. scripting language● The raw data is stored in Hadoops HDFS.● These scripting languages provide – Ease of programming. – Optimization opportunities. Hive is SQL-like – Extensibility. language●
  45. 45. Other Hadoop-related projects at Apacheinclude:● Avro™: A data serialization system. ● Pig™: A high-level data-flow language● Cassandra™: A scalable multi-master and execution framework for parallel database with no single points of failure. computation.● Chukwa™: A data collection system for ● ZooKeeper™: A high-performance managing large distributed systems. coordination service for distributed● HBase™: A scalable, distributed database applications. that supports structured data storage for large tables.● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.● Mahout™: A Scalable machine learning and data mining library.
  46. 46. Powered by Hadoop● Facebook (more than 100+ Companies are listed) – 1100-machine cluster with 8800 cores – store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning● Yahoo – Biggest cluster: 4000 nodes – Search Marketing, People you may know, Search Assist, and many more…● Ebay – 532 nodes cluster (8 * 532 cores, 5.3PB). – Using it for Search optimization and Research
  47. 47. Hadoop is not a relational database● Hadoop is best suited for batch processing of large volumes of unstructured data. – Lack of schemas – Lack of indexes – Lack of updates – pretty much absent! – Not designed for joins. – Support for Integrity Constraints – Limited support for data analysis tools But what are your data analysis needs?
  48. 48. Hadoop is not a Relational Database:If these are important, stick to RDBMS Data OLTP Data Integrity SQL Independence Ad-hoc Complex Maturity and Queries Relationships Stability
  49. 49. Do you need SQL and full relational systems?If not, consider NoSQL databases for yourneeds Key-value Tabular Document Graph
  50. 50. The Key-Value In-Memory DBs● In memory DBs are simpler and faster than their on-disk counterparts.● Key value stores offer a simple interface with no schema. Really a giant, distributed hash table.● Often used as caches for on-disk DB systems.● Advantages: – Relatively simple – Practically no server to server talk. – Linear scalability● Disadvantages: – Doesn’t understand data – no server side operations. The key and value are always strings. – It’s really meant to only be a cache – no more, no less. – No recovery, limited elasticity.
  51. 51. Voldemort is a distributed key-value storage system ● Data is automatically ● Good single node performance: you can – replicated over multiple servers. expect 10-20k operations per second – partitioned so each server contains only a – depending on the machines, the network, subset of the total data the disk system, and the data replication ● Data items are versioned factor ● Server failure is handled transparently ● Voldemort is not a relational database, – it does not attempt to satisfy arbitrary ● Each node is independent of other nodes relations while satisfying ACID properties. with no central point of failure or – Nor is it an object database that attempts coordination to transparently map object reference ● Support for pluggable data placement graphs. strategies to support things like – Nor does it introduce a new abstraction distribution across data centers that are such as document-orientation. geographically far apart. ● It is basically just a big, distributed, persistent, fault-tolerant hash table.
  52. 52. Tabular stores● The original: Google’s BigTable – Proprietary, not open source.● The open source elephant alternative – Hadoop with HBase.● A top level Apache Project.● Large number of users.● Contains a distributed file system, MapReduce, a database server (Hbase), and more.● Rack aware.
  53. 53. What is Google’s Big Table● BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.● BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesnt support joins or SQL type queries.● It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.● Commercial databases simply dont scale to this level and they dont work across 1000s machines.
  54. 54. Document Stores● As the name implies, these databases store documents.● Usually schema-free. The same database can store multiple documents.● Allow indexing based on document content.● Prominent examples: CouchDB, MongoDB.
  55. 55. Why MongoDB?● Document-oriented ● Rich query language – Documents (objects) map nicely to ● Easy scalability programming language data types – Automatic sharding (auto- – Embedded documents and arrays partitioning of data across servers) reduce need for joins – Eventually-consistent reads can be – Dynamically-typed (schemaless) distributed over replicated servers for easy schema evolution ● High performance – No joins and no multi-document transactions for high performance – No joins and embedding makes and easy scalability reads and writes fast● High availability – Indexes including indexing of keys from embedded documents and – Replicated servers with automatic arrays master failover – Optional streaming writes (no acknowledgements )
  56. 56. Mapping Systems to the CAP Theorem Availability Each client can always read and write A RDBMS Dynamo, Cassandra (MySQL, Postgres etc.), Voldermot, SimpleDB AsterData, Greenplum Tokyo Cabinet, CouchDB Vertica, CA AP KAI, Riak Partition ToleranceConsistency:All clients have the same C CP P The system works well despite physical networkview of the data BigTable, MongoDB, BerkeleyDB Hypertable, Terrastore, MemcachedDB partitions Hbase, Scalaris, Redis
  57. 57. NoSQL Use cases: Important to aligndata model to the requirements Flexible Schema Massive Write Fast Key Value Bigness and Flexible Performance Access Data Types Schema Write No single point Generally Migration Availability of failure available Ease of programming
  58. 58. Mapping new Internet Data Management Technologies to the Enterprise
  59. 59. Enterprise data strategy is gettinginclusive
  60. 60. Open Source Rules ! Hadoop Infrastructure
  61. 61. What about support !
  62. 62. The Path to Data Stack 3.0: Must support Variety, Volume and VelocityData Stack 1.0 Data Stack 2.0 Data Stack 3.0Relational Database Systems Enterprise Data Warehouse Dynamic Data PlatformRecording Business Events Support for Decision Making Uncovering Key InsightsHighly Normalized Data Un-normalized Dimensional Model Schema less ApproachGBs of Data TBs of Data PBs of DataEnd User Access through Ent Apps End User Access Through Reports End User Direct AccessStructured Structured Structured + Semi Structured 8 December 2012 62
  63. 63. Can Data Stack 3.0 Address RealProblems? Large Data Diverse Data Queries that Answer QueriesVolume at Low beyond Are Difficult to that No One Price Structured Data Answer Dare Ask8 December 2012 63
  64. 64. How does one go about the Big DataExpedition?
  65. 65. PERSISTENT SYSTEMS AND BIG DATA8 December 2012 65
  66. 66. Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solutionthat offers a direct path to unlock the value in your data.
  67. 67. Big Data Expertise at Persistent● 10+ projects executed with Leading ISVs and Enterprise Customers● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem (formed 3 years ago)● Engaged with the Big Data Ecosystem, including leading ISVs and experts • Preferred Big Data Services Partner of IBM and Microsoft8 December 2012
  68. 68. Big Data Leadership andContributions● Code Contributions to Big Data Open Source Projects, including: – Hadoop, Hive, and SciDB● Dedicated Hadoop cluster in Persistent● Created PeBAL – Persistent Big Data Analytics Library● Created Visual Programming Environment for Hadoop● Created Data Connectors for Moving Data● Pre-built Solutions to Accelerate Big Data Projects8 December 2012 68
  69. 69. Persistent’s Big Data Offerings 1. Setting up and Maintaining Big Data Platform 2. Data Analytics on Big Data Platform 3. Building Applications on Big Data Technology Assets People Assets Persistent Pre-built Persistent Pre-built Persistent Pre-built Big Data Custom Industry Solution: Industry Solution: Industry Solution: Services Retail Banking Telco Extension of Persistent Pre-built Horizontal Solutions Your TeamVisual Programming (Email, Text, IT Analytics, … ) Discovery Workshop Training for Your Team Persistent Platform Enhancement IP Tools (PeBAL Analytics Library, Data Connectors) Methodology Foundational Infrastructure and Platform Team Formation Process (Built Upon Selected 3rd Party Big Data Platforms and Technologies; Cluster Sizing/Config Cluster of Commodity Hardware) 8 December 2012 69
  70. 70. Persistent Next Generation Data Architecture Reports BI Tools & Alerts Email Email Connector Framework Media Connector Framework Server Server Admin AppWeb ProxyWeb Proxy SolutionsIBM Tivoli Workflow Integration Persistent Analytics Library (PEBAL) NoSQL Graph Fn Set Fn …. ….. ….. Text Analytics Fn BBCA Text Analytics/ Hive Social PIG/Jqal Connector GATE/SystemT Twitter, RDBMSFacebook MapReduce and HDFS Cluster Monitoring Data DW WarehouseCommercial/ Open Persistent IP External Data source Source Product 8 December 2012 70
  71. 71. Persistent Big Data Analytics Library WHY PEBAL • Lots of common problems – not all of them are solved in Map Reduce • PigLatin, Hive, JAQL are languages and not libraries – something is needed to run on top that is not tied to SQL like interaces FEATURES • Organized as JAQL functions, PeBAL implements several graph, set, text extraction, indexing and correlation algorithms. • PeBAL functions are schema agnostic. • All PeBAL functions are tried and tested against well defined use cases. BENEFITS OF A READY MADE SOLUTION • Proven – well written and tested • Reuse across multiple applications • Quicker implementation of map reduce applications8 December 2012 • High performance 71
  72. 72. Web Analytics Text Inverted Analytics Lists Set Graph Statistics8 December 2012 72
  73. 73. Visual Programming Environment ADOPTION BARRIERS • Steep Learning Curve • Difficult to Code • Ad-hoc reporting can’t always be done by writing programs • Limited tooling available VISUAL PROGRAMMING ENVIRONMENT • Use Standard ETL tool as the UI environment for generating PIG scripts BENEFITS • ETL Tools are widely used in Enterprises • Can leverage large pool of skilled people who are experts in ETL and BI tools • UI helps in iterative and rapid data analysis • More people will start using it8 December 2012 73
  74. 74. Visual Programming Environment for Hadoop DataSources ETL Tool Data Flow UI Metadata PIG Convertor PIG code PIG UDF Library HDFS/ Hive Data Data HDFS HDFS Big Data PlatformPersistent IP 8 December 2012 74
  75. 75. Persistent Connector Framework 20+ WHY CONNECTOR FRAMEWORK Years • Pluggable Architecture OUT OF THE BOX • Database, Data Warehouse • Microsoft Exchange • Web proxy • IBM Tivoli • BBCA • Generic Push connector for *any* content FEATURES • Bi-directional connector (as applicable) • Supports Push/Pull mechanism • Stores data on HDFS in an optimized format8 December 2012 • Supports masking of data 75
  76. 76. Persistent Data Connectors8 December 2012 76
  77. 77. Persistent’s Breadth of Big Data Capabilities Tooling Horizontal and Vertical Pre-built Solutions • RDBMS/DWH to import/export data • Text Analytics libraries • Data Visualization using Web2.0 and reporting tools - Big Data Platform (PeBAL) analytics Cognos, Microstrategy libraries and Connectors • Ecosystem tools like - Nutch, Katta, Lucene • Job configuration, management and monitoring with BIgInsight’s job IT Management scheduler (MetaTracker) • Job failure and recovery management Big Data Application Programming • Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs, Integration of third party tools/libraries, Performance tuning, ETL using JAQL• HDFS Distributed • Expertise in MR programming - PIG, Hive, Java MR• IBM GPFS File Systems • Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)• Platform Setup on multi- Cluster node Layer • Statistical Analytics - R, SPSS, BigInsights Integration with R clusters, monitoring, VM based setup Persistent IP for Big Data Solutions• 8 December 2012 Product Deployment Big Data Platform Components 77
  78. 78. Persistent Roadmap to Big Data Improve Knowledge Base 1. Learn Discover andand Shared Big Data Platform Define Use Cases 5. Manage 2. Initiate Measure Effectiveness Validate with and Business Value a POC 4. Measure 3. Scale Upgrade to Production if Successful 8 December 2012 78
  79. 79. Customer AnalyticsIdentifying your mostinfluential customers ? Target these customers for Identify promotions. influential Overlay sales customers data on the using network Build a social graph analysis Few thousand graph of all > 1billion transactions Influential customers customers over twenty years 70 million customers Targeting influential customers is best way to8 December 2012 improve campaign ROI! 79
  80. 80. Overview of Email Analytics● Key Business Needs – Ensure compliance with respect to a variety of business and IT communications and information sharing guidelines. – Provide an ongoing analysis of customer sentiment through email communications.● Use Cases – Quickly identify if there has been an information breach or if the information is being shared in ways that is not in compliance with organizational guidelines. – Identify if a particular customer is not being appropriately managed.● Benefits – Ability to proactively manage email analytics and communications across the organization in a cost-effective way. – Reduce the response time to manage a breach and proactively address issues that emerge through ongoing analysis of email.8 December 2012 80
  81. 81. Using Email to Analyze Customer SentimentSense the mood of your customersthrough their emailsCarry out detailed analysis on customerteam interactions and response times 8 December 2012 81
  82. 82. Analyzing Prescription Data 1.5 million patients are harmed by medication errors every year Identifying erroneous prescriptions can save lives!8 December 2012 Source: Center for Medication Safety & Clinical Improvement 82
  83. 83. Overview of IT Analytics● Key Business Needs – Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring analysis of data from various systems. – Information may be in different formats, locations, granularity, data stores. – System outages have a negative impact on short-term revenue, as well as long-term credibility and reliability. – The ability to quickly identify if a particular system is unstable and take corrective action is imperative.● Use Cases – Identify security threats and isolate the corresponding external factors quickly. – Identify if an email server is unstable, determine the priority and take preventative action before a complete failure occurs.● Benefits – Reduced maintenance cost – Higher reliablity and SLA compliance8 December 2012 83
  84. 84. Consumer Insight from Social MediaFind out what the customers aretalking about your organization orproduct in the social media 8 December 2012 84
  85. 85. Insights for Satyamev Jayate – Variety ofsources Web/TV Viewer Response to Pledge multiple choice Web, Social Media2. Unstructured Analysis 1. Structured Analysis questions (unstructured)Responses to following questions Responses to Social Media (Structured)• Share your story Pledge, multiple Web, emails, IVR/Calls• Ask a question to Aamir choice questions Individual blogs SMS• Send a message of hope IVR• Share your solution Social widgets VideosContent Filtering Rating TaggingSystem (CFRTS) …L0, L1, L2 phased analytics 3. Impact Analysis Crawling general internet for measuring the before & after scenario on a particular topic
  86. 86. Rigorous WeeklyOperation Cycleproducing instantanalyticsKiller combo of Human+Software toanalyze the data efficiently Topic opens on Sunday Episode Tags are refined and Live Analytics messages are re- report is sent ingested for during the show another pass Featured content Data capture is delivered thrice from SMS, phone a day all through calls, social out the week. media, website, JSONs are created for the System runs L0 external and Analysis, L1, L2 internal Analysts continue dashboards
  87. 87. 8 December 2012 87
  88. 88. Thank you Anand Deshpande ( Persistent Systems Limited www.persistentsys.com8 December 2012 88
  89. 89. Enterprise Value is Shifting to Data Data Apps ERP Database Operating SystemsMainframe 19758 December 2012 1985 1995 2006 2013 89