Building Big Data Applications
Upcoming SlideShare
Loading in...5
×
 

Building Big Data Applications

on

  • 1,705 views

A talk on the frameworks for building big-data applications.

A talk on the frameworks for building big-data applications.

Statistics

Views

Total Views
1,705
Views on SlideShare
1,684
Embed Views
21

Actions

Likes
2
Downloads
108
Comments
0

1 Embed 21

http://howtojboss.com 21

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building Big Data Applications Building Big Data Applications Presentation Transcript

  • Building Big Data Applications Services for Private CloudsRichard McDougallChief Architect, Storage and Application ServicesVMware, Inc@richardmcdougll © 2009 VMware Inc. All rights reserved
  • Infrastructure, Apps and now Data… Build Run Private Public ManageSimplify Infrastructure Simplify App Platform Simplify Data With Cloud Through PaaS 2
  • Trend 1/3: New Data Growing at 60% Y/YExabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,(sensors( satellite(images,(logs,(scanners,(twi7er( cad/cam,(appliances,(machine(data,(digital(movies( Source: The Information Explosion, 20093
  • Data Growth in the Enterprise4
  • Trend 2/3: Big Data – Driven by Real-World Benefit5
  • Enterprise : Early Adopter Industries and Use Cases6
  • Early Adopters: Enterprise SegmentationVerticals! Targets! Use Cases!•  Financial Services" •  Existing Hadoop Users" •  Business Trend Analytics"•  Retail" •  Business Analysts" •  Revenue analytics"•  Telco" •  Data Scientists" •  CDR, call pattern analytics"•  Manufacturing" •  LOB managers" •  Sensor data analytics"•  Government" •  IT/Ops" •  Log, machine data analytics" •  Fraud detection" •  Homeland security" •  Predictive analytics" 7
  • Early Adopters: Non-enterprise SegmentationVerticals! Targets! Use Cases!•  Online Advertising" •  End users/Exec users" •  Behavioral Analytics"•  eCommerce" •  Business Analysts" •  Audience segmentation"•  Mobile" •  PM, LOB managers" •  Revenue Optimization"•  Social Media" •  Marketing/Sales" •  User activity monetization"•  Gaming" •  Data Engineers" •  Inventory, price •  Data Scientists" management" •  IT/Operations" •  Recommendations" •  Predictive analytics" 8
  • Why now? more transactions (Social/Mobile/Local)SoMoLo 30B 500 TB messages/ 35 check-ins/ 13k API calls/ data/day month sec secBig“traditional”companies 1TB data/ day 10k card 3.7B calls/ transactions/sec month Size of data communications transactions 9
  • Trend 3/3: Value from Data Exceeds Hardware Cost!  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of 10x lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 10
  • The Old Big Data Stack Business Intelligence Extract, Transform, Data Statistics Load (SAS, SPSS) Visualization (Informatica) (Crystal, Bus O) Files SQL Databases E T L Column Oriented Relational Database (Oracle, Teradata, DB2) Master Data Management (Oracle, SAP)11
  • The Old Big Data Stack!  Unable to handle large data volumes & diversity of data!  Iterative, brute-force and slow process Business!  Lack of ad-hoc data navigation across events and Intelligence time!  Cumbersome ETL to “process” and DBAs to “prepare”!  Focused on structured data that is warehoused!  Web analytics solutions force real-time events into Data rigid schemas in DBs Extract, Transform, Load Statistics (SAS, SPSS) Visualization (Crystal, Bus (Informatica) O) Files SQL Databases E Column Oriented T Relational Database L (Oracle, Teradata, DB2) Master Data Management (Oracle, SAP) 12
  • The Journey To Big Data Analytics1 2 3 All Data Data Science Real Time Decisions Faster Answers Collaboration New Applications Elastic & Scalable Self-Service Data Monetization Big Data Enabled Apps Agile Process & Tools Analytics Engines Analytic Engines Analytic Productivity Platform Cloud Infrastructure BI As A Service Agile Analytics Predictive Enterprise Technology Focus People & Productivity Focus Application Focus Goal: encourage Goal: discover meaningful Goal: operationalize experimentation insights that those insights with existing data impact the business as quickly as possible13
  • Customer profiles1.  Business analysts, LOB managers, execs •  Need: out-of-the-box analytics •  Designed for: self-service for end-user leveraging app developers2.  Data engineers/analysts •  Need: out-of-the-box + some customization •  Designed for: admin + operations3.  Data scientists •  Need: power capabilities + heavy customization •  Designed for: data scientists4.  IT, Operations •  Need: out-of-the-box + some customization •  Designed for: IT/admin, ops14
  • What is Data Science and Data Engineering? Distributed, Math and Statistical Parallelization Algorithm Knowledge & programming Skills Data Science & Data Engineering Business Domain Vertical or Horizontal and Problem Use case and Analytics Understanding Experience15
  • What is Driving Big Data? Structured Largely UnstructuredSemi-structured Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012 16
  • Today’s Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Data Real Time Parallel Structured Big SQL Batch Database Processing Unstructured Data (HDFS)17
  • The Unified Analytics Cloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop R Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore HawQ Impala Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public18
  • BusinessThe New Big Data System Intelligence Real Time Streams Automated Models Real-Time Stream Data Visualization Processing (Excel, Tableau) E Common Query T Real Time Structured Unstructured L Structured Data and Batch Processing Database Engine (Hadoop, Hive)Federated Query(SQL aggregation) Structured and Unstructured Data (HDFS, S3) Cloud Infrastructure Compute Storage Networking 19
  • An Example – Automated Performance Management 10MPerformance Stats/min Trigger Models Batch Baseline Calculation Stats Database Cloud Infrastructure Compute Storage Networking20
  • Big (Data) problems: becoming the standardized stack Google( Facebook( Yahoo( Linked(in( Cloudera( Twi7er(Metadata& Dremel& Hive& Hive& Hive&Schedule&&&pipeline&workloads& Evenflow& Databee& Oozie& Azkaban& Oozie&dataflow/queries& A/Sawzall& /Hive& Pig/Hive& Pig& Pig/Hive& Cascading&MoreAstructured&data&store& Bigtable& Hbase& Hbase& Voldemort& Hbase& Cassandra&DB&data&collecGon/integraGon& MySQL&gateway& Sqoop& Sqoop& Data&Event&data&collecGon& Scribe& Highway& KaLa?& Flume& Scribe&Streaming&data&processing& A& A& A& A& A& A&Batch&data&processing& Map/Reduce& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop&File&Storage& GFS& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop&CoordinaGon& Chubby& Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper& 21
  • BusinessNew Technologies Intelligence Twitter Machine Real Time Sensor Data Learning CETAS Streams Mobile Events Machine Logs Automated Models S4, Storm Real-Time Stream Data Visualization … Processing (Excel, Tableau) E Common Query T SPARK Real Time Aster, Unstructured L SHARK Structured Greenplum and Batch Map-Reduce Gemfire Processing Database hBase? Etc, (Hadoop, Hive)Query Virtualization …(SQL aggregation) HDFS, Ceph, MAPR, Collosos Cloud Infrastructure Compute Storage Networking 22
  • Agenda!  Frameworks •  Batch processing: Hadoop, Spark •  Graph processing: Pregel, Apache Giraph •  Real-time processing: Storm, S4, D-Streams •  Interactive processing: Hive, Impala, Shark!  New requirements •  Better network architectures, abstractions and end-to-end resource management •  Whither disk-locality and the flexibility to move data to compute instead •  Cluster/Datacenter-wide storage abstractions and services •  The silo-less datacenter (multiple frameworks sharing a single physical cluster and sharing sticky data)23
  • Big Data Processing Patterns (batch, real-time or interactive)Hadoop,Hive, Impala Funnel Reverse Funnel Data transformStorm, S4, (large input, small (small input, large (input and outputD-Streams, output, e.g., link/ad output, e.g., sizes similar, e.g,Shark click-statistics) logfile loading) data conversion/ translation) Spark Iterative, e.g, Machine learning tasks Pregel, Giraph Graph-based analyses to reason about relationships, e.g., PageRank, Ravi s social approach to VI management 24
  • Batch processing frameworks (1/2)!  Apache Hadoop MapReduce (Yahoo!) •  Parallel data-processing paradigm (made popular by Google). Uses a distributed file system (HDFS) for persistence. Uses commodity h/w •  Model of operation: Mapper (read from HDFS + compute in parallel) -> Reducer (process map outputs in parallel) -> write to HDFS •  Key components: Namenode, Datanode, TaskTracker, JobTracker •  Apache Zookeeper sometimes used for coordination •  Weakness: Not well-suited for iterative (or graph) computations 25
  • Batch processing frameworks (2/2)!  Spark (UC Berkeley) •  Support for iterative computations and interactive data-mining by caching data in cluster RAM. Uses commodity machines •  Core abstraction: Resilient Distributed Datasets (RDDs) used as variables in Spark programs. RDDs include lineage data for easy recovery/reconstruction •  Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, … Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf26
  • Graph processing frameworks!  Pregel (Google)/Apache Giraph Compute Communicate Barrier VM1 VM2 •  Multiple instances of vertex-programs: user-defined functions running at/on each vertex •  Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank •  Stateful in-memory computations. Fault-tolerance via checkpoints •  Runs on commodity hardware (racks with high intra-rack bandwidth)27
  • Real-time processing frameworks (stream-processing) 1/2!  S4 (Yahoo!), Storm (Twitter) •  Record-at-a-time processing. Checkpointing for fault-tolerance (S4)Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf 28
  • Real-time processing frameworks (stream-processing) 2/2!  Discretized Streams/D-Streams (UC Berkeley) •  Treat a streaming computation as a series of batch computations on small time intervals. D-Stream = chain of RDDs •  Fault-tolerance without replication or upstream backup (buffering) TimeImage courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf 29
  • Interactive processing frameworks 1/4!  Apache Hive (Facebook) •  Open-source data warehouse built on top of Hadoop. HiveQL queries compiled into MapReduce jobs. Expensive Where clauses = Table scans = high latencyImage courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/ 30
  • Interactive processing frameworks 2/4!  Interactive Processing Frameworks – Pivotal Hawk31
  • Interactive processing frameworks 3/4!  Impala (Cloudera) •  Inspired by Dremel (Google). Key concepts: columnar-data storage (Trevni), aggregation trees for distributed query evaluation •  Takes advantage of Hive tables. Uses memory as a cache for tables •  Does not use MapReduce to answer queries (unlike Hive). •  3X - 90X faster than HiveImage courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ 32
  • Interactive processing frameworks 4/4!  Shark (UC Berkeley) •  Key concepts: columnar-data storage (in-memory), Directed Acyclic Graphs of Tasks for distributed query optimization and evaluation, dynamic mid-query replanning •  Uses Spark RDDs to store data and query processing results •  SQL-interface (HiveQL compatible) •  100X faster than Hadoop, 100X faster than HiveImage courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf 33
  • Unifying the Big Data Platform using Virtualization!  Goals •  Make it fast and easy to provision new data Clusters on Demand •  Allow Mixing of Workloads •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize data performance based on virtual topologies •  Make the system reliable based on virtual topologies!  Leveraging Virtualization •  Elastic scale •  Use high-availability to protect key services, e.g., Hadoop’s namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment Cloud Infrastructure Private Public34
  • A Unified Analytics Cloud Significantly Simplifies !  Simplify •  Single Hardware Infrastructure •  Faster/Easier provisioningSQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unifed Analytics Infrastructure Private Public Hadoop Cluster !  Optimize •  Shared Resources = higher utilization Decision Support Cluster •  Elastic resources = faster on-demand access 35
  • Simplify Hetrogeneous Data Management via Data PaaS Large- File- In- Big Scale system Memory SQL NoSQL Analytics Tools Developer Databases Data PaaS – Common Data Management Layer Data Platform Provisioning Multi-tenancy Import/ExportCloud Infrastructure Management Data Discovery Cloud Infrastructure36
  • Technology: Databases and Data Stores for Big Data Unstructured Structured Large- File- In- Big Scale system Memory SQL NoSQL Log files, machine Loosely typed device Types of generated data, data, records, events, Structured, Structured data Data documents, statistics, complex partitionable data device data, etc… relations/graphs Techno- NAS, HDFS, Blob, Cassandra, hBase, Gemfire, Redis, HawQ, Impala, Aster, logies S3, MAPR, etc.. Voldemort Membase, SPARK … Store any data, High performance for Easy to scale-out, easy to scale-out, High Throughput, low repetitive queries. Values flexible and dynamic can optimize for latency Ease of query schema’s cost language.37
  • The Unified Analytics Cloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop R Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore Greenplum Voldemort Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public38
  • Summary!  Revolution in Big Data is under way •  Data centric applications are now critical!  Hadoop on Virtualization •  Proven performance •  Cloud/Virtualization values apparent for Hadoop use!  Simplify through a Unified Analytics Cloud •  One Platform for today’s and future big-data systems •  Better Utilization •  Faster deployment, elastic resources •  Secure, Isolated, Multi-tenant capability for Analytics39
  • References!  Twitter •  @richardmcdougll!  My CTO Blog •  http://communities.vmware.com/community/vmtn/cto/cloud!  Hadoop on vSphere •  Talk @ Hadoop World •  Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf!  Spring Hadoop •  http://blog.springsource.org/2012/02/29/introducing-spring-hadoop40