Big Data on AWS - AWS Washington D.C. Symposium 2014

  • 760 views
Uploaded on

Big Data on AWS is a deep dive into Cloud-based big data solutions using Amazon Elastic MapReduce (EMR) and Amazon Redshift. In this session, you will learn how to create big data environments and …

Big Data on AWS is a deep dive into Cloud-based big data solutions using Amazon Elastic MapReduce (EMR) and Amazon Redshift. In this session, you will learn how to create big data environments and leverage best practices to design big data environments for security and cost-effectiveness. Demonstrations will include using Amazon EMR to process log data and the ease of provisioning a Redshift data warehouse.

More in: Technology , Travel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
760
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
74
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • From TBs to PBs, we have the capacity and scale to handle your largest big data workloads
  • When we think of big data, we think of both the proliferation of digital information and also about the innovations to exploit or extract information from that data to increase sales, efficiency, better health, analysis, predictions, recommendations, and innovation

    More specifically, we think cloud computing is a fundamental component to any big data strategy due to its inherent benefits
  • From TBs to PBs, we have the capacity and scale to handle your largest big data workloads
  • You can start and stop on demand, run big data workloads in parallel as you test out new ideas, allowing you to explore without commitments
  • With services such as Auto Scaling and elastic load balancing, you can dial up and down the amount of infrastructure you need for your variable or experimental workloads
  • The total time also includes the waiting to get access to those IT resources, with the cloud you can be up and running in minutes and in parallel allowing
  • In sense, AWS cloud democratizes big data for everyone to use and is based on two foundational benefits, lower costs and the ease of use and by focusing on these key tenents directs us in the direction of how we innovate

    Lack of constraints leads to new usage models
    Gives control back to individual development teams
    Fail-fast (and fail-cheap) opens up exploratory style
    Many customers create 100s of Amazon EMR clusters per day
    Classic burst-y workload perfect for the cloud
    Big data / HPC clusters themselves are parallelized resources
    Can you build a faster on-premises cluster? Yes, but…
    Usually a shared/contented resource; in cloud, each user/workgroup gets their own cluster
    Cloud is often the fastest platform based on “MTTJC” (Mean Time To Job Completion)
  • We provide all of our services with a self service API, we als provide managed services so you don’t have to the back end administration and you can configure your infrastructure with code, scripts or point and click from our console all the while maintaining compatability with your current tools.
  • While I won’t be able to go over all of our big data services, I would like to spend some time introducing to you several key big data services that are designed for high availability and durability,

    as a managed service where we provision the infrastructure on your behalf

    where you can get significant big data storage and analytics with a few clicks or api calls.
  • Fundamental storage at internet scale, it can store any number of objects from 1 byte to 5 TB in size
    It is engineered for 11 9’s of durability replicating your data at least three times in three distinct physical data centers we call availability zones

    We have customers such as Dropbox, Spotify, Pinterest store billions of objects or files as photos, videos, songs, or any other type of file.
  • Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources.

    For instance, instead of having to process log files in batch, you can have log events stream into Kinesis and then have workers with the Kinesis client library read from the stream and process the informaiton and drive a real time dashboard.

    Later on today, we will have the product manager, Adi Krishnan, for Amazon Kinesis give a deep dive into the service
  • DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications.

    Runs on solid state hard drives for high speed performance at scale and you can provision reads and writes to a table without having to worry about the admin of scaling or sharding, it is done all behind the scenes for you.

    For instance, real time bidding where in less than 200 milliseconds 3 rounds of bidding of what ad to place on a website while a page loads needs the performance of a single-digit millisecond latency to determine what ad to place and what price to bid for that ad impression.
  • Provision a petabyte scale cluster to handle complex SQL queries in just a few minutes.

    You can get either a HDD drive based cluster or the recently introduced SSD based cluster that is smaller in total cluster size but higher performance per GB

    This data warehouse solution is about a tenth of what traditional solutions cost of comparable size.

    Redshift can drive business intelligence tools such as Jaspersoft or Microstrategy because it supports standard SQL and can connect using ODBC or JDBC drivers.
  • When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce

    A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market
  • When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce

    A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market
  • Speaker Notes:

    We have just released “Big Data to AWS”, a new technical training course for individuals who are responsible for implementing big data environments, namely Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects. This course is designed to teach technical end users how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive. We also cover how to create big data environments, work with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.

    Upcoming classes include:
    April 22 – Redwood City, CA
    May 6 – Sao Paulo, Brazil
    May 20 – Luxembourg
    May 21 – Rio de Janeiro, Brazil
    June 3 – New York, NY, Redwood City, CA, and Colombia, MD
    June 4 – Porto Alegre, Brazil

    Audience
    Individuals responsible for implementing big data environments: Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects
    Objectives
    Understand the architecture of an Amazon EMR cluster
    Choose appropriate AWS data storage options for use with Amazon EMR
    Know your options for ingesting, transferring, and compressing data for use with Amazon EMR
    Use common programming frameworks for Amazon EMR including Hive, Pig, and Streaming
    Work with Amazon Redshift and Spark/Shark to implement big data solutions
    Leverage big data visualization software
    Choose appropriate security and cost management options for Amazon EMR
    Understand the benefits of using Amazon Kinesis for big data
    Prerequisites
    Basic familiarity with big data technologies, including Apache Hadoop and HDFS
    Knowledge of big data technologies such as Pig, Hive, and MapReduce helpful, but not required
    Working knowledge of core AWS services and public cloud implementation
    AWS Essentials course completion or equivalent experience
    Basic understanding of data warehousing, relational database systems, and database design
    Format
    Instructor-Led & Hands-on Labs
    Duration
    3 days
    Details
    aws.amazon.com/training/course-descriptions/bigdata/
  • Microstrategy
    Splunk
    QlikView
    EMR
    Pig
    MongoDB
    Oracle BI, OBIEE 11g
    SAP Hana
    Yellowfin BI
  • AWS is here to help
    Thank you very much for your time to day, that concludes this presentation.

Transcript

  • 1. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 AWS Big Data Jon Einkauf jeinkauf@amazon.com
  • 2. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Agenda • Brief overview of AWS Big Data services • Demo (Query logs in S3 using Amazon EMR) • Q&A
  • 3. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Technologies and techniques for working productively with data, at any scale. Big Data
  • 4. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Big data and AWS Big data Cloud computing Potentially massive datasets Virtually unlimited capacity
  • 5. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Big data and AWS Big data Cloud computing Iterative, experimental style of data manipulation and analysis Iterative, experimental style of infrastructure deployment/usage
  • 6. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Big data and AWS Big data Cloud computing Frequently not steady- state workload; peaks and valleys At its most efficient with highly variable workloads
  • 7. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Big data and AWS Big data Cloud computing “Time to results” is critical; shared resources are a bottleneck Parallel compute projects allow each workgroup to have more autonomy, get faster results
  • 8. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Ease of useLower costs
  • 9. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Only pay for what you use No capital investment Pay as you go Lower costs
  • 10. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Programmable Integrate with existing tools Low admin Easy to configure Ease of use
  • 11. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Use the right tools Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon Redshift Amazon Elastic MapReduce AWS Data Pipeline
  • 12. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Amazon S3 • High scalable object store • 99.999999999% durability • Encryption • Data lifecycle management
  • 13. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Amazon Kinesis • Real-time processing • High throughput • Elastic • Integrates with EMR, S3, Redshift, DynamoDB
  • 14. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Amazon DynamoDB • NoSQL database • Seamless scalability • Low admin • Single digit millisecond latency
  • 15. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Amazon Redshift • Relational data warehouse • Massively parallel • Petabyte scale • Fully Managed • Low cost ($1K/TB/Year with 3 year Reservation)
  • 16. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Amazon Elastic MapReduce (EMR) • Managed Hadoop clusters • MapReduce, Hive, Pig, Impala, HBase, Spark, Accumulo, etc. • Integrates with S3, DynamoDB, Redshift, Data Pipeline, Kinesis
  • 17. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 AWS Data Pipeline • Data-driven workflows • Integrates with EMR, EC2, S3, Redshift, DynamoDB, SNS • Process and move data between AWS and your own data center
  • 18. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Log Analysis Example
  • 19. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Demo
  • 20. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Big Data on AWS Brand new course on Big Data aws.amazon.com/training/course- descriptions/bigdata
  • 21. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 AWS Big Data Test Drives APN Partner-provided labs aws.amazon.com/testdrive/bigdata
  • 22. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 https://aws.amazon.com/tra ining AWS Training & Events Webinars, Bootcamps, and Self-Paced Labs aws.amazon.com/events
  • 23. AWS Government, Education, and Nonprofits Symposium Washington, DC | June 24, 2014 - June 26, 2014 Thank you! jeinkauf@amazon.com