Getting Started with Big Data in the Cloud


Published on

Find out what others are doing with big data on the cloud and how to get started. We will cover solutions for Hadoop and NoSQL highlighting RightScale partner technologies IBM BigInsights, Couchbase, and MongoDB.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The MapReduce Engine consists of one Job Tracker and Task Trackers assigned to every Node.  Applications submit jobs to the Job Tracker and the Job Tracker pushes the jobs to the Task Trackers closest the data. The Job Tracker knows which node the data is located – keeping the work close to the data.
  • Cassandra has no master node, and, hence, no single point of failure
  • Getting Started with Big Data in the Cloud

    1. 1. Getting Started with Big Data in the Cloud Vijay Tolani Sr. Sales EngineerTalk with the Experts.
    2. 2. 2#Agenda • What is Big Data and Why is it a Good Fit for the Cloud? • Use Cases for running Big Data in the Cloud • Storing Large Data Sets and Unstructured Data • Data Analytics using Hadoop • RightScale Ecosystem Solutions • NoSQL • Hadoop Analytics • How I learned to Use Hadoop in the CloudTalk with the Experts.
    3. 3. 3#What is Big Data?“Big data is data that exceeds the processing capacityof conventional database systems. The data is too big,moves too fast, or doesnt fit the strictures of yourdatabase architectures. To gain value from this data,you must choose an alternative way to process it.” - O’ReillyTalk with the Experts.
    4. 4. 4#Why is Big Data a Good Fit for the Cloud? What insight could you gain if you had We don’t have full use of a 100-node resources to do cluster anything like that What if one hour of this 100-node cluster would cost $34?Talk with the Experts. 4
    5. 5. 5#Relational Databases…since 1970Data is stored in TablesData is accessed via SQL QueriesTalk with the Experts.
    6. 6. 6#Now Let Me Tell You a StoryTalk with the Experts.
    7. 7. 7#Draw Something Goes Viral Daily Active Users (millions) 16 14 12 10 8 6 4 2 2/6 8 10 12 14 16 18 20 22 24 26 28 3/1 3 5 7 9 11 13 15 17 19 21Talk with the Experts.
    8. 8. 8#As Usage Grew, So Did Game Data Daily Active Users (millions) 16 14 By March 29, there were 12 over 30,000,000 downloads of the app, over 5,000 drawings being stored per second, 10 over 2,200,000,000 drawings stored, over 105,000 database transactions per second, 8 and over 3.3 terabytes of data stored. 6 4 2 2/6 8 10 12 14 16 18 20 22 24 26 28 3/1 3 5 7 9 11 13 15 17 19 21Talk with the Experts.
    9. 9. 9#This Isn’t The Only ExampleFood for Thought:• Facebook is expected to have more than 1 billion users by August 2012, handles 40 billion photos, and generates 10 TB of log data per day.• Twitter has more than 100 million users and generates some 7 TB of tweet data per day.• For every trading session, the NYSE captures 1 TB of trade information.Conventional Data Warehouses and SQL Databases do not meet thedemands of many of today’s applications with 3 key metrics:• Volume• Variety• VelocityTalk with the Experts.
    10. 10. 10#Storing Large Data Sets in the Cloud • “I want to use Hadoop, but I’m out of capacity in my current Data Warehouse.” • If you can’t store the data, you can’t analyze the data. • Many customers are choosing to begin their Big Data projects by implementing NoSQL databases to store large volumes of data in a variety of formats (Structured, Unstructured, & Semi- Structured)Talk with the Experts.
    11. 11. 11#What is NoSQL? • Highly Scalable, Distributed, & Fault Tolerant • Designed for use on Commodity Hardware. • Does NOT use SQL • Do NOT Guarantee Immediate Consistency Ideal Use Cases for NoSQL Databases when the following criteria is met: • Simple Data Models are used. • Flexibility is more important than strict control over defined Data Structures. • High Performance is a must. • Strict Data Consistency is not required.Talk with the Experts.
    12. 12. 12#Types of NoSQL DatabasesKey-Value StoreDocument DatabaseColumn Oriented DatabaseTalk with the Experts.
    13. 13. 13#MapReduce MapReduce paradigm consists of three steps: 1. Mapper function or script that goes through your input data and outputs a series of keys and values. 2. Sort the unordered list of keys and to ensure all the fragments that have the same key are next to one another in the file. 3. The reducer stage then goes through the sorted output and receives all of the values that have the same key in a contiguous block.Talk with the Experts.
    14. 14. 14#Hadoop ArchitectureTalk with the Experts.
    15. 15. 15#Hadoop ConceptsTalk with the Experts.
    16. 16. 16#Interacting with HadoopHive• Program hadoop jobs using SQL.• Caution: Because of Hadoop’s focus on large-scale processing, the latency may mean that even simple jobs take minutes to complete, so it’s not a substitute for a real-time transactional database.Pig• Procedural data processing language designed for Hadoop where you specify a series of steps to perform on the data.• Often described as “the duct tape of Big Data” for its usefulness there, and it is often combined with custom streaming code written in a scripting language for more general operations.Talk with the Experts.
    17. 17. 17#Key-Value Stores• Use a hash table where there is a unique key and a pointer to a particular item of data.• Typical Application: Content Caching• Example: RedisTalk with the Experts.
    18. 18. 18#Document Databases• Document databases are essentially the next level of Key-Value stores, allowing nested values associated with each key.• The semi-structured documents are stored in formats such as JSON.• Typical Applications: Web Apps• MongoDB and Couchbase Hadoop Connectors• Example: Couchbase, MongoDBTalk with the Experts.
    19. 19. 19#MongoDB Hadoop IntegrationBuilt in MapReduce• Built in MapReduce (JavaScript Only)• Limited Scalability• One JavaScript Implementation at a TimeHadoop Connector• Integrating MongoDB and Hadoop to Read/Write data to/from MongoDB via HadoopTalk with the Experts.
    20. 20. 20#Column Oriented Database• Store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns.• Typical Application: Distributed File Systems• Native Hadoop Integration for Hbase and Cassandra• Example: Cassandra, HBaseTalk with the Experts.
    21. 21. 21#Cassandra Hadoop Integration• Native Support for Apache Pig and Apache Hive• Cassandras Hadoop support implements the same interface as HDFS to achieve input data locality• One thing Cassandra can’t do well yet is MapReduce.• MapReduce and related systems such as Pig and Hive work well with HBase because it uses hadoop HDFS to store its data.Talk with the Experts.
    22. 22. 22#My Approach to Learning about usingHadoop in the Cloud… courtesy of IBM• Learn It • Big Data University• Try It • BigInsights Basic, Available for Free in the MultiCloud MarketPlace• Buy It • BigInsights Enterprise for Advanced FunctionalityTalk with the Experts.
    23. 23. 23#How I Learned to use Hadoop in theCloud • Hadoop Fundamentals • Hadoop Architecture, MapReduce, and HDFS • Using Pig and Hive • Using BigInsights in the Cloud with RightScale • The Best Part – It’s Free!! • with the Experts.
    24. 24. 24#BigInsights Basic – Get Started for Free • Available in the MultiCloud MarketPlace • Free for Data Sets up to 10 TBTalk with the Experts.
    25. 25. 25#BigInsights EnterpriseTalk with the Experts.
    26. 26. Questions?Talk with the Experts.