• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
"Demystifying Big Data by AIBDP.org
 

"Demystifying Big Data by AIBDP.org

on

  • 656 views

 

Statistics

Views

Total Views
656
Views on SlideShare
656
Embed Views
0

Actions

Likes
3
Downloads
39
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Progression of Analytics 3 minutes The new phenomenon - Big Data 4 minutesBig Data Defined 3 minutes 2 minutesWhere is the Technology 5 minutesWhat can we solve with Big Data – example Case Studies 5 minutesWhat is next ? Where are the opportunities ? 10 minutes
  • Internal Information –Known questions and answers - Known structures, structured data types, known volumes, mostly transactional dataMaster data is very well defined - Storage Typical Data Warehouses, Data Marts using batch processing & traditional ETL, and relational databasesData growth is incremental and regular archivalJust reporting, a little bit of mining – mostly descriptive - predictive analysis is very light Cross functional integration of data is very limited, very structured around customers, services & products, logistics etc.Functional & Technical responsibilities are very clearly demarcated. Mostly Data engineers / architects at the backend supporting business analysts / users.Most of the reports are just a measurement of their tactics – more supporting the strategy than inducing a strategyData sizes are in Giga and Terra byte range, becomes inefficient and costly after a certain size limit.
  • Narrow & focused business missions – not “fit-for-all” but “fit-for-purpose” The need to discover more - Facts, Relationships, Indicators, Patterns, Trends, Pointers which could not probably be discovered before by using cross integration of data from various sourcesNeed to capture & store data and just not collect Proliferation of data sources – variety of dataMulti-Dimensional Data Streaming Data Geo Spatial DataSocial Networking Data Internal Data (RDBMS) Video & Image dataText data (logs etc) Time series Data GenomicsProliferation of volume of data ( crossed to Petabytes and above)Internet / intranet Social networks ( FB & Twitter) Mobile DevicesSmart Home devices Smart systems (Utilities etc) Media & entertainmentThe demand for the speed (velocity) of the data collected, understood, processed, and distributedAccessibility - where when, who, and how Time value – Real Time or notIncreased speeds of consumption Increased speeds of data generation Demand for high value & accuracy ( veracity) of information Advent of Technology with Massive Parallel processing - Availability of Hadoop / Map reduce kind of open source & packaged technologiesAffordability of infrastructure – Commodity servers vs. Specialized serversHadoop enables a computing solution that is:Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
  • The word of the hour is “SMART” !! Smart Business – Targeted value proposition Businesses are under pressure to maximize their investments ( focused approach, not one-fit-all methodology)Targeted value proposition Targeted advertisement, Tailored menu, Focused Initiatives, Individualized Attention, Non-impersonal Messaging, Efficient Governance, Greater AccuracyNarrow & focused business missions – not “fit-for-all” but “fit-for-purpose” The need to discover more - Facts, Relationships, Indicators, Patterns, Trends, Pointers which could not probably be discovered before by using cross integration of data from various sourcesNeed to capture & store data and just not collect Proliferation of data sources – variety of dataMulti-Dimensional Data Streaming Data Geo Spatial DataSocial Networking Data Internal Data (RDBMS) Video & Image dataText data (logs etc) Time series Data GenomicsProliferation of volume of data ( crossed to Petabytes and above)Internet / intranet Social networks ( FB & Twitter) Mobile DevicesSmart Home devices Smart systems (Utilities etc) Media & entertainmentThe demand for the speed (velocity) of the data collected, understood, processed, and distributedAccessibility - where when, who, and how Time value – Real Time or notIncreased speeds of consumption Increased speeds of data generation Demand for high value & accuracy ( veracity) of information Advent of Technology with Massive Parallel processing - Availability of Hadoop / Map reduce kind of open source & packaged technologiesAffordability of infrastructure – Commodity servers vs. Specialized serversHadoop enables a computing solution that is:Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
  • Targeted advertisement, Tailored menu, focused initiatives, individualized attention, non-impersonal messaging, efficient governance, greater accuracyBusinesses want to gain competitive advantage by being able to take action based on timely, relevant, complete, and accurate information, ratherthan one-fit-for all solutionsThere is immense volume, variety and velocity of data that is produced today is new information, facts, relationships, indicators and pointers, that either could not be practically discovered in the past, or simply did not exist before
  • Targeted advertisement, Tailored menu, focused initiatives, individualized attention, non-impersonal messaging, efficient governance, greater accuracyBusinesses want to gain competitive advantage by being able to take action based on timely, relevant, complete, and accurate information, ratherthan one-fit-for all solutionsThere is immense volume, variety and velocity of data that is produced today is new information, facts, relationships, indicators and pointers, that either could not be practically discovered in the past, or simply did not exist before
  • Market has just started picking upThere is a lot of gap in vertical solutionsBiggest gap is in Big Data ServicesHardware & Software components seem to have been available already
  • Adapting to Real-time Analysis ( may be use HANA !!)Development of industry standardsDevelopment of Universal Schema for metadata and catalogingTools to support security & data governanceSupport for Cloud-ification (multi-tenancy)Support for data lineageFramework for cross-application integrationSupport for testingAutomated & configurable monitoring and management console User interface (UI) frameworks
  • Business Focus Identify data needs for strategic business functions Identify Business Issues that need to be solved by big Data Layout data dependencies between functions Resolve Competing priorities Clearly lay out the levels of data, cross-functional requirementsTechnology Focus Identify the right technology to align with the current landscape for synergies in technology Take stock of existing “technology assets” towards Big DataAssess your current capabilities and architecture to support your goals, and select the deployment strategy that best fits your Big Data questions Identify the resources and minimize “specialties” to exploit synergies with existing resource pool Lay out a development methodology to streamline deliveryStakeholder Focus Clearly identify the stake holders at all levels of data consumption Present best practices and align them with the project Plan out the objectives, scope, and timelinesIdentify the KPIs, Reports, Dashboards, Predictive & Prescriptive Analysis to be deliveredProcess Focus Establish clear data flows from collection of data to consumption of data Identify Data Governance execution process – People, Processes, Mechanisms Design the process to be more Business focused than IT Clearly establish measures to achieve – Accuracy, Repeatability, Agility, and accountability ( reconcilability)

"Demystifying Big Data by AIBDP.org "Demystifying Big Data by AIBDP.org Presentation Transcript

  • De-Mystifying Big Data Prasad Mavuduri American Institute of Big Data Professionals
  • RIGHTFOCUSANDONTARGET Agenda Analyze & Define • Progression of Analytics • The new phenomenon - Big Data • Big Data Defined Technology Discussion • Big Data Technology – Hadoop • Big Data – Big Savings – Hadoop Use Cases • What can we solve with Big Data – example • What is next ? Where are the opportunities
  • RIGHTFOCUSANDONTARGET Progression of Analytics Structured – Known Data Traditional – ETL, Data Marts, DW, RDBMS Growth – Normal Incremental – Archive Less Cross Functional Integration More Tactical than Strategic Sizes GBs to TBs Data Architects vs. Functional So Far…..
  • RIGHTFOCUSANDONTARGET The new phenomenon - Big Data Growing Pains ??!!! Big Data ?!!! Is it just data ?
  • RIGHTFOCUSANDONTARGET The new phenomenon - Big Data 1. No to “fit-for-all” but Yes to “fit-for-purpose” 2. Proliferation of data sources – variety of data 3. Proliferation of volume of data 4. The demand for the speed (velocity) of data 5. Demand for high value & accuracy ( veracity) of info 6. Massive Parallel processing 7. Commodity servers vs. Specialized servers DATA DRIVEN BUSINESS is THE SMART BUSINESS
  • RIGHTFOCUSANDONTARGET Big Data Definition • High volume of data which is growing every year more than 50 % every year • High Speed Streaming, Machine generated data etc • Different Data sources In-the- enterprise and external data around the enterprise data • Data collected taking huge memory (typically 100 TB or more) where RDBMS is inefficient Value Variety VolumeVelocity VERACITY Meaningful
  • RIGHTFOCUSANDONTARGET Big Data Definition VERACITY Big Data is the new art and science, using Massive Parallel Processing (MPP) technology, of collection, storage, processing, distribution, and analysis of data with any of the attributes – high volume, high velocity, high variety to extract high value and greater accuracy (veracity). IBM Says, BIG DATA means 1.Volume (Terabytes --‐> Zettabytes) 2. Variety (Structured --‐> Semi--‐structured --‐> Unstructured) 3. Velocity (Batch --‐> Streaming Data)
  • RIGHTFOCUSANDONTARGET Big Data Technologies – Typical Stack Big Data Infrastructure Data Manipulation & Management Data Analysis & Mining Predictive & Prescriptive Analysis Process Automation& Decision Support Systems Big Data Stack
  • RIGHTFOCUSANDONTARGET Big Data Technologies – SMAQ User-friendly Analytics 1. PIG ( simple Query Language), 2. HIVE ( Similar to SQL) 3. Cascading ( Workflow) 4. Mahout ( Machine Learning) 5. Zookeeper (Coordination Service) Data Distribution & Management across nodes in Batch Mode 1. Hadoop MapReduce 2. Alternative – BashReduce, Disco Project, Spark, GraphLab (C&M), Strom, HPCC (LexisNexis) Distributed Non-Relational 1. HBase ( columnar DB) 2. HDFS – Hadoop Distributed File System Query Map Reduce Storage SMAQ Stack
  • RIGHTFOCUSANDONTARGET Big Data – Big Savings – Economics ROI on Big Data Approach (with Hadoop) Source : American Institute for Analytics 1TB of RDBMS TCO $37,000 - Traditional RDBMS $2,000 only !!!! Hadoop Source :American Institute for Analytics
  • RIGHTFOCUSANDONTARGET Where is the market on Big Data Infrastructure / Framework / Analytics software Horizontal Solutions like EDW etc HealthCare RetailIndustry Government/ Publicsector Education& HumanCapital HealthSciences /Genomics Telecommunicat ions/Services Energy& Utilities E-Commerce/ Marketing Media& Entertainment Source: IDC 2011 0 5 10 15 20 2010 2011 2012 2013 2014 2015 Big Data Market In $B Current State
  • RIGHTFOCUSANDONTARGET Web Logs Images & Videos Social Media Documents Structured Data Big Data / Hadoop etc. Existing EDW Prescriptive Predictive Reporting OLAP Modeling Integrated Big data Implementation - Architecture Coexistence of Big Data with existing EDW Connectors / Adapters
  • RIGHTFOCUSANDONTARGET Web Logs Images & Videos Social Media Documents Structured Data Big Data / Hadoop etc. Prescriptive Predictive Reporting OLAP Modeling Pure Big data Implementation - Architecture Pure Big Data Connectors / Adapters Barriers Disruption to existing Analytics ?! Roadmap / Methodology Certainty of costs HADOOP / Big Table can replace traditional EDWs !!
  • RIGHTFOCUSANDONTARGET Big Data Landscape
  • RIGHTFOCUSANDONTARGET Big Data Landscape
  • RIGHTFOCUSANDONTARGET Applied BIG Data
  • RIGHTFOCUSANDONTARGET BIG Data Opportunities Some Gaps & opportunities •Real-time Analysis ( may be use SAP HANA etc !!) •User interface (UI) frameworks •App development Big Data on Cloud (multi-Tenancy) •Security & Data Governance •Cross Application Integration •Industry Standards
  • RIGHTFOCUSANDONTARGET AIBDP – Contribution to Big Data
  • RIGHTFOCUSANDONTARGET Business Focus  Identify data needs Identify Business Issues  Layout data dependencies between functions  Resolve Competing priorities  Clearly lay out the levels of data, cross-functional requirements Stakeholder Focus  Identify the stake holders  Align best practices with the project  Plan out the objectives, scope, and timelines Identify the KPIs, Reports, Dashboards, Predictiv e & Prescriptive Analysis to be delivered Technology Focus  Synergies in current technology  Take stock of existing “technology assets” towards Big Data Assess your current capabilities and architecture  Identify the resources and minimize “specialties” to exploit synergies with existing resource pool  Lay out a development methodology to streamline delivery Process Focus  Establish clear data flows  Identify Data Governance execution process – People, Processes, Mechanisms  Design the process to be more Business focused than IT  Clearly establish measures to achieve – Accuracy, Repeatability, Agility, and accountability ( reconcilability) Our Big Data Strategy at a glance
  • RIGHTFOCUSANDONTARGET Our Execution Approach – AGILE methodology Agile Approach to reduce risks • Close coordination between the customer and the developer • Small incremental steps makes testing easier and manageable & avoid surprises • Early recovery from expectation mismatch • Clarity on Design understanding and regular communication with user. • Early warning about risks regular status reports. • Full Knowledge Transfer
  • RIGHTFOCUSANDONTARGET Thank You !! Please contact us for any enquiries at: Prasad Mavuduri prasad@aibdp.org 408 828 9909 Q & A