• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Research on big data
 

Research on big data

on

  • 4,318 views

Big Data projects overview at EMC Labs China...

Big Data projects overview at EMC Labs China
• Introduction to Cloud Databases
• Data analytics in the cloud
– Parallel DBMS
– MapReduce
• FlexDB - A cloud-scale database engine based on Hadoop

Statistics

Views

Total Views
4,318
Views on SlideShare
4,297
Embed Views
21

Actions

Likes
15
Downloads
369
Comments
1

2 Embeds 21

http://www.techgig.com 20
http://www.mefeedia.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • O(∩_∩)O~
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Research on big data Research on big data Presentation Transcript

    • Research on Big Data - FlexDB: A cloud-scale database engine based on Hadoop Jidong Chen (jidong.chen@emc.com) Manager, Research Scientist, Big Data Lab EMC Labs China Sept. 2011© Copyright 2011 EMC Corporation. All rights reserved. 1
    • Grand Opening Announcement EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO.© Copyright 2011 EMC Corporation. All rights reserved. 2
    • EMC Labs China - Vision and Mission Advanced Technology Research and Development University Collaboration Vision Big Data Lab Become an elite research and advanced technology institute Industry Standards in China Cloud Infrastructure Office - and System Lab Become the model for future EMC Labs Cloud Platform and worldwide IP Portfolio Applications Lab Development© Copyright 2011 EMC Corporation. All rights reserved. 3
    • Outline• Big Data projects overview at EMC Labs China• Introduction to Cloud Databases• Data analytics in the cloud – Parallel DBMS – MapReduce• FlexDB - A cloud-scale database engine based on Hadoop• Summary© Copyright 2011 EMC Corporation. All rights reserved. 4
    • The Digital Universe 2009-2020 Growing by a Factor of 442009:0.8 Zb 2020: 35.2 ZettabytesSource: IDC Digital Universe Study, sponsored by EMC, May 2010 © Copyright 2011 EMC Corporation. All rights reserved. 5
    • Big Data is Changing the World Expanding Data Sources Bigger Challenges• Science and research • Scale out automatically – Gene sequences – Vs. scale up manually – LHC accelerator – Earth and space exploration • More capacity and bigger pool – E.g., 10 PB in a single file system• Enterprise applications – Email, documents, files • New process capability – Applications log – Loading, Analyzing, Moving data – Transaction records – Intelligence• Web 2.0 data • Better performance – Search log / click stream – Linear vs. exponent – Twitter/ Blog / SNS – Faster – Wiki • Autonomous• Other unstructured data – Fewer human interference – Video/Movie – Lower cost – Graphics – Digital widgets© Copyright 2011 EMC Corporation. All rights reserved. 6
    • Research Scopes and Topics in Big Data• Search and Analytics – Search: Entity Search, Faceted Search, Associative Search – Analytics: Text Analysis, Activity Modeling and Sequence Analysis, Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms• MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning and Replication, Distributed Transaction – In-memory Database: Cache, Recovery, Consistence – Database as a Service: Multi-tenant Data Management, Auto- Administration• Hadoop/NoSQL – Hadoop: Single-node Failure, Performance, Real-time MapReduce Scheduler and Fault Tolerance – NoSQL: Key-Value Store, Documents Store, Graph Data Store© Copyright 2011 EMC Corporation. All rights reserved. 7
    • Project Overview• Hadoop/NoSQL – vHadoop - joint project with VMWare • Parallel SAN file system for DISC on virtualized platform – Online MapReduce for Real-time Data Analytics • Pipelined task execution, Group task scheduling, Enhanced fault tolerance • Parallel Data Mining – FlexDB: Cloud-scale Parallel Database for OLAP • MapReduce integration into DBMS, Parallel query execution, Cost-based query optimization – Cloud-scale Parallel Database for OLTP • Intelligent database sharding and resharding • Active-active (eager) replication with group communication service • Multiple masters with elastic distributed coordination© Copyright 2011 EMC Corporation. All rights reserved. 8
    • Cloud Databases • Two largest components of data management market – Transactional Data Management • Banks, airline reservation, online e-commerce • ACID, write-intensive – Analytical Data Management • Business planning, decision support • Query-intensive • Challenges of data management in the Cloud – Scalability – Fault Tolerance – Availability & Consistence – Transaction Management – Flexible Schemes© Copyright 2011 EMC Corporation. All rights reserved. 9
    • Cloud Databases • Data analytics in the cloud – Parallel DBMS – MapReduce • Transactional data management in the cloud – NoSQL Store – SQL Database • Cloud data services (Database as a Service) – Multi-tenant data management – Auto-administration© Copyright 2011 EMC Corporation. All rights reserved. 10
    • Commercial Landscape Major Players • Amazon EC2 – IaaS abstraction – Data management using S3 and SimpleDB • Microsoft Azure – PaaS abstraction – Relational engine (SQL Azure) • Google AppEngine – PaaS abstraction – Data management using Google MegaStore© Copyright 2011 EMC Corporation. All rights reserved. 11
    • Data Analytics in the Cloud• Scalability to large data volumes: – Scan 100 TB on 1 node @ 50 MB/sec = 23 days – Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning)• Cost-efficiency: – Commodity nodes (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers)© Copyright 2011 EMC Corporation. All rights reserved. 12
    • Solutions for Large-scale Data Analysis • Parallel DBMS technologies – Proposed in late eighties – Matured over the last two decades – Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises • Map Reduce – pioneered by Google – popularized by Yahoo! (Hadoop)© Copyright 2011 EMC Corporation. All rights reserved. 13
    • Parallel DBMS technologies • Popularly used for more than two decades – Research Projects: Gamma, Grace, … – Commercial: Teradata, Greenplum (acquired by EMC), Netezza (acquired by IBM), DATAllegro (acquired by Microsoft), Vertica (acquired by HP), Aster Data (acquired by Teradata) • Share-nothing nodes clusters • Relational Data Model • Indexing • Familiar SQL interface • Parallel query execution – Horizontal partitioning of relational tables with partitioned execution of SQL queries • Advanced query optimization • Well understood and studied© Copyright 2011 EMC Corporation. All rights reserved. 14
    • Greenplum: A Share-nothing Parallel DBMS  Greenplum’s MPP Database has extreme scalability – Optimized for BI and analytics – Fault-tolerant reliability and optimized performance using commodity CPUs, disks and networking Interconnect  Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database – Tables are automatically distributed across nodes  Extremely scalable and I/O optimized – All nodes can scan and process in parallel Loading – No I/O contention between segments  Linear scalability by adding nodes – Each adds storage, query performance and loading performance© Copyright 2011 EMC Corporation. All rights reserved. 15
    • Greenplum Database Architecture MPP (Massively Parallel Processing) SQL MapReduce Shared-Nothing Architecture Master Servers ... ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc.© Copyright 2011 EMC Corporation. All rights reserved. 16
    • Example of Parallel Query Optimization Gather Motion 4:1 (slice 3)select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as Sortrevenue, c_acctbal, n_name, c_address, c_phone,c_comment HashAggregatefrom customer, orders, lineitem, nation HashJoinwhere c_custkey = o_custkey Redistribute Motion 4:4 Hash (slice 1) and l_orderkey = o_orderkey and o_orderdate >= date 1994-08-01 HashJoin HashJoin and o_orderdate < date 1994-08-01 + interval 3 month Seq Scan on Seq Scan on and l_returnflag = R Hash Hash lineitem customer and c_nationkey = n_nationkey Broadcast Motion 4:4group by Seq Scan on orders (slice 2) c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment Seq Scan on nationorder by revenue desc© Copyright 2011 EMC Corporation. All rights reserved. 17
    • MapReduce • Overview – large-scale, massively parallel data access platform – Simple data-parallel programming model to express relatively sophisticated distributed programs – An associated parallel and distributed implementation for commodity clusters • Pioneered by Google – Processes 20 PB of data per day • Popularized by open-source Hadoop project – Used by Yahoo!, Facebook, Amazon, and the list is growing …© Copyright 2011 EMC Corporation. All rights reserved. 18
    • Programming Framework Raw Input: <key, value> MAP <K1, V1> <K2,V2> <K3,V3> REDUCE© Copyright 2011 EMC Corporation. All rights reserved. 19
    • MapReduce Example: WordCount Reduce(K, V[ ]) { Int count = 0; For each v in V Map(K, V) { count += v; For each word w in V Collect(K, count); Collect(w, 1); } } combine part0 map reduce Cat split . Cat 3 . reduce part1 Bat 4 . split map combine Bat Dog 3 … . . map part2 split combine reduce Dog . Combine(K, V[ ]) { . map Int count = 0;Other split For each v in VWords count += v; Collect(K, count); (size: }TByte)© Copyright 2011 EMC Corporation. All rights reserved. 20
    • MapReduce Implementation in Hadoop client job master assign assign map reduce mapper split0 write reducer file0 split1 read local remote split2 mapper write read split3 reducer file1 split4 mapper input map intermediate files reduce output files phase (local disk) phase files© Copyright 2011 EMC Corporation. All rights reserved. 21
    • MapReduce Advantages • Automatic Parallelization: – Depending on the size of RAW INPUT DATA  instantiate multiple MAP tasks – Similarly, depending upon the number of intermediate <key, value> partitions  instantiate multiple REDUCE tasks • Run-time: – Data partitioning – Task scheduling – Handling machine failures – Managing inter-machine communication • Completely transparent to the programmer/analyst/user© Copyright 2011 EMC Corporation. All rights reserved. 22
    • Possible Applications • Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc. – ETL and “read once” data sets – Complex analytics – Semi-structured data, key-value pairs • At Google and others (Yahoo!, Facebook): – Inverted index – Graph structure of the WEB documents – Summaries of #pages/host, set of frequent queries, etc. – Ad Optimization – Spam filtering© Copyright 2011 EMC Corporation. All rights reserved. 23
    • Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support  Not out of the box Indexing  Not out of the box Imperative Declarative (C/C++, Java, …) Programming Model (SQL) Extensions through Pig and Hive Optimizations (Compression, Query  Not out of the box Optimization) Flexibility Not out of the box  Coarse grained Fault Tolerance  techniques© Copyright 2011 EMC Corporation. All rights reserved. 24
    • Further Analysis and Comparison• Limitations of some current parallel database / data warehouse – Often use expensive/specialized hardware – Difficult to scale to more than 100 nodes – Difficult to parallelize data mining applications • MPI … – Difficult to deal with unstructured data – Fault tolerance • One node fails, restart whole query – Expensive• Disadvantages of some MapReduce based solution (Hive) – A sub-optimal brute force implementation: No indexing, No JOINs • Find those guys whose salary is $10,000 – Row based storage, Updates? – Not SQL/BI tool compatible – No support for schema – Non-declarative programming model© Copyright 2011 EMC Corporation. All rights reserved. 25
    • MapReduce Integration in DBMS Context • FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project) – An architectural hybrid of MapReduce and DBMS technologies – Use Fault-tolerance and Scalability of Map Reduce framework – Leverage advanced data processing techniques (e.g., Query Optimization) of an RDBMS for high performance – Expose a declarative interface to the user • Goal: Leverage from the best of both worlds© Copyright 2011 EMC Corporation. All rights reserved. 26
    • FlexDB Architecture© Copyright 2011 EMC Corporation. All rights reserved. 27
    • FlexDB Master Query Parser SELECT * FROM Account Query Optimizer WHERE balance > 30 Job Generator Catalog manager Job Executor Job Job Job Job MapReduce Mapper FrameworkAccount Reducerr0 n0 m0 SELECT * SELECT * SELECT *r1 n1 m1 FROM Account FROM Account FROM Accountr2 n2 m2 WHERE balance > 30 WHERE balance > 30 WHERE balance > 30r3 n3 m3 subquery subquery subqueryr4 n4 m4r5 n5 m5r6 n6 m6r7 n7 m7 Database Database Database Database Database Database Database r0 n0 m0 r2 n2 m2 r4 n4 m4 r6 n6 m6 r8 n8 m8 r1 n1 m1 r3 n3 m3 r5 n5 m5 r7 n7 m7 r9 n9 m9 © Copyright 2011 EMC Corporation. All rights reserved. 28
    • Comparison with other systems FlexDB Hive HadoopDB Traditional parallel database Query Language SQL HQL SQL (not SQL support join currently) Storage Postgres/Greenplum HDFS JDBC Native OS files compatible Optimizer Cost based (DB/MR Simple rule Simple rule Cost based paths) based based Physical storage Column/Row based Row based Currently Row Column/Row based organization based Implementation FlexDB Master + Hive + Hadoop Hive (rev) + Native Hadoop + DB Hadoop + DB Efficiency High Low Middle Very High Scale Large Large Large Middle Cost Low Low Low High© Copyright 2011 EMC Corporation. All rights reserved. 29
    • Summary • New in cloud computing – Elasticity/Scalability – Resource sharing (multi-tenancy) – Focus on failure • Data analytics in the cloud: Different solutions suitable for different workloads – Parallel DBMSs excel at efficient querying of large data sets – MR-style systems excel at complex analytics and ETL tasks • Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market© Copyright 2011 EMC Corporation. All rights reserved. 30
    • Acknowledgements • Some slides are adapted from the following references: – Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010© Copyright 2011 EMC Corporation. All rights reserved. 31
    • 易安信中国研究院 陶波 博士 易安信中国研究院 院长 博客 http://blog.sina.com.cn/emclabschina 微博 http://weibo.com/emclabschina© Copyright 2011 EMC Corporation. All rights reserved. 32
    • THANK YOU© Copyright 2011 EMC Corporation. All rights reserved. 33