• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data Analytics
 

Big Data Analytics

on

  • 1,841 views

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to ...

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Statistics

Views

Total Views
1,841
Views on SlideShare
1,839
Embed Views
2

Actions

Likes
9
Downloads
178
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data Analytics Big Data Analytics Presentation Transcript

    • Big Data Analytics Peter SirotaGeneral Manager, Amazon Elastic MapReduce
    • Overview1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud Computing4. The Big Data ecosystem
    • 1Introducing Big Data
    • Generation Collection & storageAnalytics & computationCollaboration & sharing
    • The cost of data generation is falling
    • Lower cost,higher throughput Generation Collection & storage Analytics & computation Collaboration & sharing
    • Lower cost,higher throughput Generation Highly Collection & storage constrained Analytics & computation Collaboration & sharing
    • Data volume Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
    • Elastic and highly scalable +No upfront capital expense Remove + =Only pay for what you use constraints + Available on-demand
    • Lower cost,higher throughput Generation Highly Collection & storage constrained Analytics & computation Collaboration & sharing
    • Generation Collection & storageAccelerated Analytics & computation Collaboration & sharing
    • Close the gap.
    • Big DataTechnologies and techniques for working productively with data, at any scale.
    • 2 From data toactionable information
    • “Who buys video games?”
    • Per day: 3.5 billion records13 TB of click stream logs71 million unique cookies
    • Results: 500% return on ad spend17,000% reduction in procurement time
    • “Who is using our service?”
    • Finding signal in the noise of logs Identified early mobile usage Invested heavily in mobile development
    • In January 2013 9,432,061 unique mobile devices used the Yelp mobile app.4 million+ calls. 5 million+ directions.
    • Open web index.3.4 billion records. Available to all.
    • Full parse for impact of social networks 300 lines of Ruby code. 14 hours. $100.
    • Tweeting about Flu You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
    • Tweeting about Food Tweets aboutthe price of rice Official food price inflation
    • 3 Analytics andCloud Computing
    • Generation Collection & storageAnalytics & computationCollaboration & sharing
    • Generation S3, Glacier, Collection & storage Storage Gateway, DynamoDB, Redshift, RDS, HBaseAnalytics & computationCollaboration & sharing
    • Generation Collection & storage EC2 &Analytics & computation Elastic MapReduceCollaboration & sharing
    • Generation Collection & storageAnalytics & computation EC2 & S3,Collaboration & sharing CloudFormation, Elastic MapReduce, RDS, DynamoDB, Redshift
    • Generation S3, Glacier, Storage Gateway, DynamoDB, Collection & storage Redshift, RDS, HBaseAWS Data Pipeline EC2 & Analytics & computation Elastic MapReduce EC2 & S3, Collaboration & sharing CloudFormation, Elastic MapReduce, RDS, DynamoDB, Redshift
    • Generation S3, Glacier, Storage Gateway, DynamoDB, Collection & storage Redshift, RDS, HBaseAWS Data Pipeline EC2 & Analytics & computation Elastic MapReduce EC2 & S3, Collaboration & sharing CloudFormation, Elastic MapReduce, RDS, DynamoDB, Redshift
    • Elastic MapReduce
    • Managed Hadoop analytics
    • S3, DynamoDB, RedshiftInput data
    • S3, DynamoDB, Redshift Input dataCode Elastic MapReduce
    • S3, DynamoDB, Redshift Input dataCode Elastic Name MapReduce node
    • S3, DynamoDB, Redshift Input dataCode Elastic Name MapReduce node S3/HDFS Elastic cluster
    • S3, DynamoDB, Redshift Input dataCode Elastic Name MapReduce node Queries S3/HDFS + BI Via JDBC, Pig, Hive Elastic cluster
    • S3, DynamoDB, Redshift Input dataCode Elastic Name Output MapReduce node Queries S3/HDFS + BI Via JDBC, Pig, Hive Elastic cluster
    • S3, DynamoDB, RedshiftInput data Output
    • 1. Elastic clusters
    • 10 hours
    • 6 hours
    • Peak capacity
    • 2. Rapid, tuned provisioning
    • Tedious.
    • Remove undifferentiated heavy lifting.
    • 3. Hadoop all the way down
    • Robust ecosystem.Databases, machine learning, segmentation, clustering, analytics, metadata stores, exchange formats, and so on...
    • 4. Agility for experimentation
    • Instance choice.Stay flexible on instance type & number.
    • 5. Cost optimizations
    • Built for Spot.Name-your-price supercomputing.
    • 1. Elastic clusters2. Rapid, tuned provisioning3. Hadoop all the way down4. Agility for experimentation.5. Cost optimizations
    • Vin Sharma vin.sharma@intel.comDirector, Product Strategy & MarketingBig Data Software, Intel Corporation
    • Analysis of Data Can Transform Society Enhance scientific Create new business Increase public safety understanding, drive models and improve and improve innovation, and organizational energy efficiency withaccelerate medical cures. processes. smart grids.
    • Intel’s Vision to Democratize Big DataUnlock Value in Support Open Deliver Software Value Silicon Platforms
    • Intel at the Intersection of Big Data HPC Cloud Open Source Enabling exascale Helping enterprises Contributing codecomputing on massive build open and fostering data sets interoperable clouds ecosystem
    • Intel® Technology at the Heart of the Cloud Server Storage Network
    • Scale-Out Big DataCompute Platform Optimization Cost-effective performance •Intel® Advanced Vector Extension Technology •Intel® Turbo Boost Technology 2.0 •Intel® Advanced Encryption Standard New Instructions Technology
    • Intel® Advanced Vector Extensions Technology • Newest in a long line of processor instruction innovations • Increases floating point operations per clock up to 2X1 performance Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer See backup for configuration details. software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other 1 : Performance comparison using Linpack benchmark. systems, components, information information on performance forecasts go to http://www.intel.com/performance For more legal and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.73
    • Intel® Turbo Boost Technology 2.0 More Performance Higher turbo speeds maximize performance for single and multi-threaded applications
    • Intel® Advanced Encryption Standard New Instructions • Processor assistance for performing AES encryption 7 new instructions • Makes enabled encryption software faster and stronger
    • The Power of Intel® Platform Solutions: TeraSort for 50% Richer 1 TB sort Reduction user experiences4 HRS 80% Reduction 50% Reduction 40% Reduction Previous Intel® Xeon® Intel® Xeon® Solid-State 10 MIN Processor Drive 10G Processor E5 2600 Ethernet Intel® Apache Hadoop
    • The Virtuous Cycle of User Experience ClientsCloud Intelligent Systems
    • 4The Big Data Ecosystem
    • Data, data, everywhere... Data is stored in silos.
    • S3 HBase on EMR RDSDynamoDB EMR Redshift On-premises
    • “How do I get my data to the cloud?”
    • Data mobility Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
    • “How do I integrate my data for maximum impact?”
    • S3 HBase on EMR RDSDynamoDB EMR Redshift On-premises
    • S3 HBase on EMR RDSDynamoDB EMR Redshift On-premises
    • S3 HBase on EMR RDSDynamoDB EMR Redshift On premises
    • S3 HBase on EMR RDSDynamoDB EMR Redshift On premises
    • S3 HBase on EMR RDSDynamoDB EMR Redshift On premises
    • AWS Data PipelineOrchestration for data-intensive workloads. Announced in November, available now.
    • AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage temporary compute resources
    • Anatomy of a pipeline
    • Additional checks and notifications
    • Arbitrarily complex pipelines
    • aws.amazon.com/datapipeline
    • aws.amazon.com/big-data
    • Summary1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud Computing4. The Big Data ecosystem
    • Get 600 Hours of free supercomputing time! www.powerof60.com
    • Thank you!sirota@amazon.com