Getting Started with Big Data in the Cloud

Getting Started with Big Data in the Cloud

Vijay Tolani
Sr. Sales Engineer

Talk with the Experts.

2#

Agenda
• What is Big Data and Why is it a Good Fit for the Cloud?

• Use Cases for running Big Data in the Cloud
• Storing Large Data Sets and Unstructured Data
• Data Analytics using Hadoop

• RightScale Ecosystem Solutions
• NoSQL
• Hadoop Analytics

• How I learned to Use Hadoop in the Cloud


3#

What is Big Data?

“Big data is data that exceeds the processing capacity
of conventional database systems. The data is too big,
moves too fast, or doesn't fit the strictures of your
database architectures. To gain value from this data,
you must choose an alternative way to process it.”

- O’Reilly


4#

Why is Big Data a Good Fit for the Cloud?
What insight could
you gain if you had We don’t have
full use of a 100-node resources to do
cluster anything like that

What if one hour of
this 100-node cluster
would cost $34?

4

5#

Relational Databases…since 1970
Data is stored in Tables

Data is accessed via SQL Queries


6#

Now Let Me Tell You a Story


7#

Draw Something Goes Viral
Daily Active Users (millions)
16

14

12

10

8

6

4

2

2/6 8 10 12 14 16 18 20 22 24 26 28 3/1 3 5 7 9 11 13 15 17 19 21


8#

As Usage Grew, So Did Game Data
Daily Active Users (millions)
16

14

By March 29, there were
12 over 30,000,000 downloads of the app,
over 5,000 drawings being stored per second,
10 over 2,200,000,000 drawings stored,
over 105,000 database transactions per second,
8 and over 3.3 terabytes of data stored.

6

4

2

2/6 8 10 12 14 16 18 20 22 24 26 28 3/1 3 5 7 9 11 13 15 17 19 21


9#

This Isn’t The Only Example
Food for Thought:

• Facebook is expected to have more than 1 billion users by August
2012, handles 40 billion photos, and generates 10 TB of log data per day.
• Twitter has more than 100 million users and generates some 7 TB of tweet
data per day.
• For every trading session, the NYSE captures 1 TB of trade information.

Conventional Data Warehouses and SQL Databases do not meet the
demands of many of today’s applications with 3 key metrics:

• Volume
• Variety
• Velocity


10#

Storing Large Data Sets in the Cloud

• “I want to use Hadoop, but I’m out of capacity in my current
Data Warehouse.”

• If you can’t store the data, you can’t analyze the data.

• Many customers are choosing to begin their Big Data projects
by implementing NoSQL databases to store large volumes of
data in a variety of formats (Structured, Unstructured, & Semi-
Structured)


11#

What is NoSQL?
• Highly Scalable, Distributed, & Fault Tolerant

• Designed for use on Commodity Hardware.

• Does NOT use SQL

• Do NOT Guarantee Immediate Consistency

Ideal Use Cases for NoSQL Databases when the following criteria is
met:

• Simple Data Models are used.
• Flexibility is more important than strict control over defined Data
Structures.
• High Performance is a must.
• Strict Data Consistency is not required.


12#

Types of NoSQL Databases
Key-Value Store

Document Database

Column Oriented Database


13#

MapReduce
MapReduce paradigm consists of three steps:

1. Mapper function or script that goes through your input data and outputs a
series of keys and values.
2. Sort the unordered list of keys and to ensure all the fragments that have the
same key are next to one another in the file.
3. The reducer stage then goes through the sorted output and receives all of the
values that have the same key in a contiguous block.


14#

Hadoop Architecture


15#

Hadoop Concepts


16#

Interacting with Hadoop
Hive

• Program hadoop jobs using SQL.
• Caution: Because of Hadoop’s focus on large-scale processing, the latency may mean
that even simple jobs take minutes to complete, so it’s not a substitute for a real-time
transactional database.

Pig

• Procedural data processing language designed for Hadoop where you specify a series
of steps to perform on the data.
• Often described as “the duct tape of Big Data” for its usefulness there, and it is
often combined with custom streaming code written in a scripting language for more
general operations.


17#

Key-Value Stores
• Use a hash table where there is a unique key and a pointer to a
particular item of data.

• Typical Application: Content Caching

• Example: Redis


18#

Document Databases
• Document databases are essentially the next level of Key-Value
stores, allowing nested values associated with each key.
• The semi-structured documents are stored in formats such as
JSON.

• Typical Applications: Web Apps

• MongoDB and Couchbase Hadoop Connectors

• Example: Couchbase, MongoDB


19#

MongoDB Hadoop Integration

Built in MapReduce
• Built in MapReduce (JavaScript Only)
• Limited Scalability
• One JavaScript Implementation at a Time

Hadoop Connector
• Integrating MongoDB and Hadoop to Read/Write data to/from MongoDB
via Hadoop


20#

Column Oriented Database
• Store and process very large amounts of data distributed over
many machines. There are still keys but they point to multiple
columns.

• Typical Application: Distributed File Systems

• Native Hadoop Integration for Hbase and Cassandra

• Example: Cassandra, HBase


21#

Cassandra Hadoop Integration
• Native Support for Apache Pig and Apache Hive
• Cassandra's Hadoop support implements the same interface as HDFS to achieve input data locality

• One thing Cassandra can’t do well yet is MapReduce.
• MapReduce and related systems such as Pig and Hive work well with HBase because it uses hadoop
HDFS to store its data.


22#

My Approach to Learning about using
Hadoop in the Cloud… courtesy of IBM

• Learn It
• Big Data University

• Try It
• BigInsights Basic, Available for Free in the MultiCloud MarketPlace

• Buy It
• BigInsights Enterprise for Advanced Functionality


23#

How I Learned to use Hadoop in the
Cloud
• Hadoop Fundamentals
• Hadoop Architecture, MapReduce, and HDFS
• Using Pig and Hive
• Using BigInsights in the Cloud with RightScale
• The Best Part – It’s Free!!
• http://www.bigdatauniversity.com/


24#

BigInsights Basic – Get Started for Free

• Available in the MultiCloud MarketPlace

• Free for Data Sets up to 10 TB


25#

BigInsights Enterprise


Questions?


Getting Started with Big Data in the Cloud

More Related Content

What's hot

Viewers also liked

Similar to Getting Started with Big Data in the Cloud

More from RightScale

Recently uploaded

Getting Started with Big Data in the Cloud

Editor's Notes