A technical Introduction to Big Data Analytics


Published on

This presentation gives the details about the sources for big data, the value of big data, what to do with big data, the platforms, the infrastructures and the architectures for big data analytics

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A technical Introduction to Big Data Analytics

  1. 1. A Technical Introduction to Big Data Analytics Pethuru Raj PhD Infrastructure Architect IBM Global Cloud Center of Excellence (CoE) IBM India, Bangalore E-mail: peterindia@gmail.com
  2. 2. The Business Intelligence (BI) in the Pre-Big Data Era
  3. 3. The Business Intelligence (BI) in the Post-Big Data Era
  4. 4. The Classification of the IT Trends • The Technology Space - There is a cornucopia of technologies (Computing, Connectivity, Miniaturization, Middleware, Sensing, Actuation, Perception, Analyses, Knowledge Engineering, etc.) • The Process Space – With new kinds of services, applications, data, infrastructures, and devices joining into the mainstream IT, fresh process consolidation, orchestration, governance and management mechanisms are emerging. That is, process excellence is the ultimate aim • Infrastructure Space – Infrastructure consolidation, convergence, centralization, federation, automation and sharing methods clearly indicate the infrastructure trends in the computing and communication disciplines. Physical infrastructures turn to be virtual infrastructures. Two major infrastructural types are • System Infrastructure (Compute, Storage, & Network) • Application Infrastructure – Integration Backbones, Platforms (Design, Development, Deployment, Delivery, Management, etc.), Messaging Middleware, Databases (SQL and NoSQL), etc. • Architecture Space – Service oriented architecture (SOA), event-driven architecture (EDA), model- driven architecture (MDA), resource oriented architecture (ROA) and so on are the leading architectural patterns • The Device Space is fast evolving (Slim & Sleek, handy & trendy, mobile, wearable, implantable, portable, etc.). Everyday machines are tied up with one another as well as to the remote Web / Cloud • Data Space – Data are being produced in an automated and massive manner
  5. 5. The TectonicTrendsTowards the Ensuing Knowledge Era 1. Data is being positioned as the strategic asset for any organization 2. Analytics has been an important ingredient for worldwide business enterprises to Strategize and Plan Ahead Take Informed Decisions Proceed with Confidence and Clarity (Insights-driven Enterprises) With the arrival of newer technologies, the capabilities and competencies of Analytics have been consistently on the climb. In sync up with big data, platforms and infrastructures, big insights will become the norm for worldwide organizations
  6. 6. For any Strategic and SustainableTransformation  Leverage Data Assets Insightfully  Optimize InfrastructureTechnologically  Innovate Processes Consistently  Assimilate Architectures Appropriately  ChooseTechnologies Carefully  Ensure Accessibility, Simplicity & Consumability Cognitively
  7. 7. The Principal Sources for Big Data
  8. 8. 8 The Convergence ofTechnologies lays a profound foundation for Large-scale Data Generation Social Media Cloud Computing Mobile Internet ofThings
  9. 9. The Extreme Connectivity enables Data Generation in Heaps
  10. 10. The Deeper and Broader Integration pours out Big Data • Device to Device (D2D) Integration • Device to Enterprise (D2E) Integration - In order to have remote and real-time monitoring, management, repair, and maintenance, and for enabling decision- support and expert systems, ground-level heterogeneous devices have to be synchronized with control-level enterprise packages such as ERP, SCM, CRM, KM etc. • Device to Cloud (D2C) Integration - As most of the enterprise systems are moving to clouds, device to cloud (D2C) connectivity is gaining importance. • Cloud to Cloud (C2C) Integration – Disparate, distributed and decentralised clouds are getting connected to provide better prospects
  11. 11. The Interconnectivity of Devices generates Large-scale Fast Data
  12. 12. The Technology Cluster Stack Sensors, Actuators, Controllers, Tags, Stickers, consumer electronics, appliances, Devices, Machines, Utensils, instruments, gadgets, smart materials Service oriented device middleware for message routing, enrichment, adaptation etc. Applications, Services, Data sources, Packages, Platforms, Middleware, etc. Clouds (Consolidated, Centralized / Federated, Virtualized, Automated and Shared Infrastructures) Physical World Cyber World Physical Devices Device Middleware Virtual Applications & Platforms Virtual Infrastructur es
  13. 13. SomeTidbits on the Enormity of Data
  14. 14. The Unequivocal Result : the Data-drivenWorld  BusinessTransactions, Interactions, Operations, and Analytical data  System Infrastructure Log files  Social & People data  Customer, Product, Sales and other business data  Machine and Sensor Data  Scientific Experimentation & Observation Data (Genetics, Particle Physics, Climate modeling, Drug Discovery, etc.,)
  15. 15. Why Big Data is Strategically Significant for Businesses?
  16. 16. Big Data brings in  Enhanced Business Value through better performance and productivity  Bigger and Bigger Insights through a host of newer Analytics and Use Cases
  17. 17. Big Data :The BusinessValue 18
  18. 18. What to Do with Big Data?
  19. 19. Big Data  Big Insights  Aggregate all kinds of distributed, different and decentralized data  Analyze the formatted and formalized data  Articulate the extracted actionable intelligence  Act based on the insights delivered and raise the bar for futuristic analytics (Real-time, predictive, prescriptive and personal analytics)  Accentuate business performance and productivity
  20. 20. Big Data Analytics: Key Drivers and Applications
  21. 21. The Drivers for Big Data Analysis 1. There is an Exponential Growth in Data Generation due to ◦ The continued increase in diverse and distributed data sources 2. The Maturity,Stability and Convergence ofTechnologies - DataVirtualization, Management, Storage,Transmission,Analysis andVisualizationTechniques,Tips, andTools 3. The Massive Adoption and Adaption of Cloud Infrastructures (Compute, Storage and Network) 4. The Realization of more comprehensive, accurate, and speedier Knowledge Discovery and Dissemination Platforms and Processes 5. Enhanced BusinessValue 6. NewerTypes of Analytics ◦ Domain-specific Analytics (Customer Sentiment, Social, Security, Retail, Fraud Detection Analysis, etc.) and ◦ Generic Analytics(Predictive, Prescriptive, High-Performance, Real-time, Smarter Analytics, etc.)
  22. 22. The Reference Architectures for Big Data Analytics
  23. 23. The Emerging and Evolving Analytics
  24. 24. The Traditional Business Analytics
  25. 25. The Next-Generation Business Analytics
  26. 26. Social Media and Network Analytics
  27. 27. Machine Data Analytics - Use Cases Here are a few ROI examples from a 1% improvement in productivity across different industries:  Commercial aviation industry — a 1% improvement in fuel savings would yield a savings of $30 billion over 15 years.  Utilities — In global gas-fired power plant fleet a 1% improvement could yield a $66 billion savings in fuel consumption.  Global health care industry — A 1% efficiency gain from reduction of process inefficiencies globally could yield more than $63 billion in health care savings.  Railway Networks — Freight moved across the world rail networks, if improved by 1% could yield another gain of $27 billion in fuel savings.  Upstream Oil and Gas Exploration – a 1% improvement in capital utilization upstream oil and gas exploration and development could total $90 billion in avoided or deferred capital expenditures. The convergence of intelligent devices, intelligent networks and intelligent decisioning (Insight vs. Hindsight analytics) is definitely paving the foundation for the next growth spurt or productivity gains.
  28. 28. Machine Data Analytics – Use Cases
  29. 29. Machine Data Analytics
  30. 30. BatchVs Real-time Analytics
  31. 31. BatchVs Real-time Analytics
  32. 32. How Does Real-Time AnalyticsWork?
  33. 33. The Real-time Analytics Architecture
  34. 34. In-Memory Data Analytics
  35. 35. In-Memory Computing Reference Architecture
  36. 36. Context-Aware Analytics
  37. 37. Big Data Analytics:The Key Platforms
  38. 38. Big Data Analytics:The Platforms  Analytical, Distributed, Scalable and Parallel Databases  Data warehouses, Data Marts, etc.  In-Memory Systems (SAP HANA, etc.)  In-Database Systems (SAS, etc.)  Distributed File Systems (HDFS)  Hadoop Implementations (Cloudera, Map R, HortonWorks,Apache Hadoop, DataStax, etc.)  NoSQL & Hybrid Databases
  39. 39. Parallel DBMS  Standard relational tables and SQL ◦ Indexing, compression,caching, I/O sharing ◦ Tables partitioned over nodes ◦ Transparent to the user  Meet performance ◦ Needed highly skilled DBA  Flexible query interfaces ◦ UDFs varies accros implementations Fault tolerance ◦ Not score so well Assumption: failures are rare Assumption: dozens of nodes in clusters 45
  40. 40. MapReduce Programming Model & Hadoop Platforms  MapReduce is a programming model which specifies: ◦ A map function that processes a key/value pair to generate a set of intermediate key/value pairs, ◦ A reduce function that merges all intermediate values associated with the same intermediate key.  Hadoop comprises large-scale, distributed, elastic, and fault-tolerant data processing and storage modules ◦ Is a MapReduce implementation for processing large data sets over 1000s of nodes. ◦ Maps and Reduces run independently of each other over blocks of data distributed across a cluster 46
  41. 41. The Hadoop Architecture
  42. 42. How Hadoop Functions?
  43. 43. The Hadoop-based Big Data Business Analytics
  44. 44. Why Hadoop?  Better application development productivity through a more flexible data model;  Greater ability to scale dynamically to support more users and data;  Improved performance to satisfy expectations of users wanting highly responsive applications and to allow more complex processing of data.  Scalability to large data volumes: ◦ Scan 100 TB on 1 node @ 50 MB/sec = 23 days ◦ Scan on 1000-node cluster = 33 minutes  Divide-And-Conquer (i.e., data partitioning)  Cost-efficiency ◦ Commodity nodes (cheap, but unreliable) ◦ Commodity network ◦ Automatic fault-tolerance (fewer administrators) ◦ Easy to use (fewer programmers)  Satisfies fault tolerance  Works on heterogeneous environment
  45. 45. NoSQL Databases NoSQL encompasses a wide variety of different database technologies and were developed in response to a rise in the volume of data stored about users, objects and products, the frequency in which this data is accessed, and performance and processing needs. Document databases pair each key with a complex data structure known as a document.Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks, such as social connections.Graph stores include Neo4J and HyperGraphDB. Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak andVoldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality. Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.  Cassandra (Facebook) (CQL is the query language)  BigTable (Google)  Dynomo (Amazon)  RIAK (SoftLayer) (Apache Lucene)  MongoDB  CouchDB (UNQL is the query language0
  46. 46. RelationalVs. NoSQL Databases SQL Databases NoSQL Databases The relational model takes data and separates it into many interrelated tables. Tables reference each other through foreign keys The relational model minimizes the amount of storage space required, because each piece of data is only stored in one place. However, space efficiency comes at expense of increased complexity when looking up data. The desired information needs to be collected from many tables (often hundreds in today’s enterprise applications) and combined before it can be provided to the application. When writing data, the write needs to be coordinated and performed on many tables. Developers generally use object-oriented programming languages to build applications. It’s usually most efficient to work with data that’s in the form of an object with a complex structure consisting of nested data, lists, arrays, etc. The relational data model provides a very limited data structure that doesn’t map well to the object model. Instead data must be stored and retrieved from tens or even hundreds of interrelated tables. Object-relational frameworks provide some relief but the fundamental impedance mismatch still exists between the way an application would like to see its data and the way it’s actually stored in a relational database NoSQL databases have a very different model. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the JSON format. Each JSON document can be thought of as an object to be used by your application. A JSON document might, for example, take all the data stored in a row that spans 20 tables of a relational database and aggregate it into a single document/object. Aggregating this information may lead to duplication of information, but since storage is no longer cost prohibitive, the resulting data model flexibility, ease of efficiently distributing the resulting documents and read and write performance improvements make it an easy trade-off for web-based applications. Document databases, on the other hand, can store an entire object in a single JSON document and support complex data structures. This makes it easier to conceptualize data as well as write, debug, and evolve applications, often with fewer lines of code
  47. 47. RelationalVs. NoSQL Databases SQL Databases NoSQL Databases Relational technology requires strict definition of a schema prior to storing any data into a database. Changing the schema once data is inserted is a big deal. Want to start capturing new information not previously considered? Want to make rapid changes to application behavior requiring changes to data formats and content? With relational technology, changes like these are extremely disruptive and frequently avoided RDBMS supports scale-up implying the fundamentally centralized, shared-everything architecture of relational database technology Enhancement Techniques include 1. Sharding 2. Denormalizing, 3. Distributed caching NoSQL databases especially document databases are typically schemaless, allowing you to freely add fields to JSON documents without having to first define changes. The format of the data being inserted can be changed at any time, without application disruption. This allows application developers to move quickly to incorporate new data into their applications. NoSQL use a cluster of standard, physical or virtual servers to store data and support database operations. Support the following Auto-sharding Data Replication Distributed query support – “Sharding” a relational database can reduce, or eliminate in certain cases, the ability to perform complex data queries. NoSQL database systems retain their full query expressive power even when distributed across hundreds of servers. Integrated caching – Transparently cache data in system memory. This behavior is transparent to the application developer and the operations team, compared to relational technology where a caching tier is usually a separate infrastructure tier that must be developed to, deployed on separate servers, and explicitly managed by the ops team.
  48. 48. The Capability Comparison of Different Analytical Platforms
  49. 49. The Big Data Analytics Infrastructures
  50. 50. Big Data Analytics – The Emerging Infrastructures  Analytic, Scalable, Parallel and Distributed Databases & DataWarehouses - Hardware Appliances (MPP and SMP)  In-Memory Compute Infrastructures (SAP HANA on IBM Power 7)  In-Database Compute Infrastructures (SAS Teradata, etc.)  Expertly Integrated Systems (IBM PureData System for Hadoop,Analytics, etc.)  Clouds (public, private and hybrid) comprising bare metal servers and virtual machines (VMs)
  51. 51. In-Memory Data Grid (IMDG)  An IMDG is a distributed non-relational data or object store. It can be distributed to span more than one server.  Reading from memory is more than 3,300 times faster than reading from disk.A simple calculation would suggest that if it takes an hour to read a set of information from disk, it would take just over a second to read it from memory  This approach brings data to the cloud, where the application can interact with it, and the application is completely shielded from the complexity of having to persist or replicate data back to the on-premise store.  The use of an IMDG also means that while the data is available on the cloud, it is only available in memory and is never stored on a disk in the cloud.  IMDGs usually support linear scaling to support high loads, data partitioning, redundancy, and automatic data recovery in case of failures.
  52. 52. The Big Data Analytics in Clouds
  53. 53. TheTypes of Big Data Analytics in Cloud
  54. 54. Big Data Analytics in Clouds
  55. 55. Why Big Data Analytics in Clouds?  Agility & Affordability - No capital investment of a large size of Infrastructures. Just Use and Pay  Hadoop Platforms in Clouds - Deploying and using any Hadoop Platforms (generic or specific, open or commercial-grade, etc.) are fast  NoSQL Databases in Clouds - NoSQL databases are made available in Clouds  WAN OptimizationTechnologies - There areWAN optimization products and platforms for efficiently transmitting data over the Internet infrastructure  Business Applications in Clouds - With enterprise information systems (EISs), high- performance computing systems, and the establishment of data storage, social, device and sensor clouds go up in public clouds, big data analytics at remote, Internet-scale clouds makes sense.  Cloud Integrators, Brokers & Orchestrators –There are products and platforms for seamless interoperability among different and distributed systems, services and data
  56. 56. Entering into the HybridWorld 1. TheTraditional Analytical Systems (Data Warehouse)Vs.The Big Data Analytical systems (Hadoop) 2. TheTraditional Databases (RDBMS)Vs.The NoSQL Databases 3. The Scalable, Distributed, Parallel RDBMSVs.The NoSQL Databases
  57. 57. The HybridWorld
  58. 58. The Data Analytics: the Converged Architecture
  59. 59. Big Data Analytics Solution Architectures for Different Industry Segments
  60. 60. Big Data Insights for Media Industry – A Solution Architecture
  61. 61. Social Network Analytics – A Solution Architecture
  62. 62. Big Data Analytics: the Summary  Digitalization, service-enablement, extreme connectivity, distribution, commoditization, Consumerization, Industrialization, etc. are the brewing trends towards big data  DataVolume,Variety,Velocity andVariability are on the Rise signalling a heightened DataValue.This development is due to the diversity and multiplicity of data sources.  Data Capturing,transmission, Cleansing, Filtering, Formatting, and StorageTasks,Tools, andTechnologies are maturing fast  Big Data platforms, patterns, practices, products, processes and infrastructures are being developed to streamline big data analytics
  63. 63. The Big Picture Enterprise Space Embedded Space Cloud Space Integration Bus
  64. 64. A Sample List of Book Chapters
  65. 65. Pethuru Raj PhD peterindia@gmail.com www.peterindia.net http://www.linkedin.com/in/peterindia https://www.facebook.com/sweetypeter