• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The elephantintheroom bigdataanalyticsinthecloud
 

The elephantintheroom bigdataanalyticsinthecloud

on

  • 350 views

 

Statistics

Views

Total Views
350
Views on SlideShare
350
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The elephantintheroom bigdataanalyticsinthecloud The elephantintheroom bigdataanalyticsinthecloud Presentation Transcript

    • The Elephant In The Room Big Data Analytics In the Cloud Bill Peer, Principal, Infosys Labs UP 2012 Cloud Computing ConferenceSan Francisco, California – December 12, 2012
    • What’s on the agenda?• Definitions• Big Data and Analytic Technologies• Architecture Stuff• SummaryAppendix : References 2
    • Definitions – Big Data• Big Data – data processing scenarios wherein the volume, variety, and/or velocity of the data is such that conventional RDBMS and/or Data Warehouse technologies alone do not suffice for the need, Bill’s Stake In The Ground: • Volume - Greater than 100 GB • Variety - Structured and Unstructured (forms, video, blogs, photos, …) • Velocity - 10 GB per hour 3
    • Definitions – Analytics• Analytics – discovery of meaningful patterns in data Bill’s Two Uses: • Decision Support (to help make a choice) -Business Intelligence -Operational Intelligence • Value Creation (to add worth) -Algorithm Discovery -Analytics as a Service 4
    • Past, Present, Future(images sourced from WikiCommons) 5
    • Big Data and Analytic Technologies 6
    • Big Data and Analytic Technology : 3 to Know• Based on Google Paper published in 2004 (MapReduce)• Can be segmented into 2 key capabilities: MapReduce and HDFS• Designed to work in a distributed, fault possible environment MapReduce – HDFS – Job Based! Processing Hadoop File System Orchestration (Reliable independent Pig Latin - Language to explore data Framework of persistence Hive QL– SQL like calls (Great if a problem mechanism by way of Mahout – Machine Learning collection can be easily divided) multi-node replication) 7
    • Big Data and Analytic Technology : 3 to Know DRILL• Based on Google Paper published in 2010 (Dremel)• Provides analysis of large-scale datasets• Designed to work in a distributed environment Query Languages- Low-Latency Apache Incubator Phase Google BigQuery Distributed “[Dremel] is capable of running aggregation Execution – queries over trillion-row tables in seconds. Columnar centric The system scales to thousands of CPUs storage and petabytes of data, and has thousands of users at Google.” src: Google Dremel Paper 8
    • Big Data and Analytic Technology : 3 to Know Storm• Event Streaming platform used by Twitter• Allows for continuous real-time data spelunking• Designed to work in a distributed environment leveraging clusters Resident Queries- Topology Centric- Event Streaming is Different Requests for event You create graphs of computation Storm can be used effectively to build a patterns of interest Complex Event Processing (CEP) are continuously capability by an enterprise. As with other watched for CEP type frameworks, it requires a shift to an uncommon perspective to be effective. 9
    • A Cloud Centric Big Data NRT Architecture CEP Interactive Query *Architecture Graphic is a modified version of WSO2’s BAM picture Not Cloud In Cloud Not Cloud 10
    • Big, Big Data Analytic Architecture Consideration• Data Transfer Speed • Where is your data? Is it where you will be processing? • 1TB of Data takes: • 300 hours over a 10Mbps network • 30 hours over a 100Mbps network • 3 hours over a 1Gbps network • 20 minutes over a 10Gbps network 11
    • Framework For Selecting Approach 12
    • Summary• “approaches for near-real time Business Intelligence and Analytics”• “Info. on technologies ranging from Hadoop to Dremel to Event Streaming “• “applicability and limitations of these when in the Cloud”• “high-level architectures that must be considered will be shared”• “entertained, energized, and enlightened”• “realistic frame of reference to bring back to their organization”• “Journey to the Clouds”• “Dumbo can really fly” 13
    • Feedback FormsPlease extract from your walletOne of the feedback forms to the rightAdd any commentary you have in theWhite space, and hand to thePresenter after the sessionThank you for attending!See you in the Clouds! 14
    • References• Big Data Spectrum, Infosys http://www.infosys.com/cloud/resource-center/Pages/big-data-spectrum.aspx• Dremel: Interactive Analysis of Web-Scale Datasets, Melnik et. all, Google http://research.google.com/pubs/pub36632.html• DrillProposal, Apache http://wiki.apache.org/incubator/DrillProposal• Storm Rationale https://github.com/nathanmarz/storm/wiki/Rationale• WSO2 BAM, wso2 http://wso2.com/products/business-activity-monitor/ 15