Where is my data (in the cloud) tamir dresher

Tamir Dresher
Senior Software Architect
May 19, 2014
Where is my Data? (In the Cloud)

About Me
• Software architect, consultant and instructor
• Software Engineering Lecturer @ Ruppin Academic Center
• Technology addict
• 10 years of experience
• .NET and Native Windows Programming
@tamir_dresher
tamirdr@codevalue.net
http://www.TamirDresher.com.

Agenda
• Storage
• Blob
• Azure SQL Server
• Azure Tables
• HDInsight

Storage
Where is my data Storage

Types of information
Where is my data Storage

North America Europe Asia Pacific
Datacenters
Windows Azure Growing Global Presence
Storage SLA – 99.99%
52.56 minutes per year
http://azure.microsoft.com/en-us/support/legal/sla

What is a BLOB
• BLOB – Binary Large OBject
• Storage for any type of entity such as binary files and text
documents
• Distributed File Service (DFS)
– Scalability and High availability
• BLOB file is distributed between multiple server and replicated
at least 3 times
Where is my data BLOB

Blob Storage Concepts
11

Blob Operations
REST

BLOBS
• Block blob - up to 200 GB in size
• Page blobs – up to 1 TB in size
• Total Account Capacity - 500 TB
• Pricing
– Storage capacity used
– Replication option (LRS, GRS, RA-GRS)
– Number of requests
– Data egress
– http://azure.microsoft.com/en-us/pricing/details/storage/

SQL Azure
• SQL Server in the cloud
• No administrative overheads
• High Availability
• pay-as-you-grow pricing
• Familiar Development Model*
* Despite missing features and some limitations - http://msdn.microsoft.com/en-us/library/ff394115.aspx
Where is my data SQL Azure

DEMO
Creating and Using SQL Azure
17

SQL Azure – Pricing

Case Study - https://haveibeenpwned.com/

• http://www.troyhunt.com/2013/12/working-with-154-million-
records-on.html
• How do I make querying 154 million email addresses as fast as
possible?
• if I want 100GB of SQL Server and I want to hit it 10 million
times, it’ll cost me $176 a month (now its ~20$)

Table Storage Concepts
22
Where is my data Tables

Table Storage
• Not RDBMS
– No relationships between entities
– NoSql
• Entity can have up to 255 properties - Up to 1MB per entity
• Mandatory Properties for every entity
– PartitionKey & RowKey (only indexed properties)
• Uniquely identifies an entity
• Same RowKey can be used in different PartitionKey
• Defines the sort order
– Timestamp - Optimistic Concurrency

No Fixed Schema
24

Table Object Model
• ITableEntity interface –PartitionKey, RowKey, Timestamp, and
Etag properties
– Implemented by TableEntity and DynamicTableEntity
// This class defines one additional property of integer type,
// since it derives from TableEntity it will be automatically
// serialized and deserialized.
public class SampleEntity : TableEntity
{
public int SampleProperty { get; set; }
}

Sample – Inserting an Entity into a Table
// You will need the following using statements
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Table;
// Create the table client.
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable peopleTable = tableClient.GetTableReference("people");
peopleTable.CreateIfNotExists();
// Create a new customer entity.
CustomerEntity customer1 = new CustomerEntity("Harp", "Walter");
customer1.Email = "Walter@contoso.com";
customer1.PhoneNumber = "425-555-0101";
// Create an operation to add the new customer to the people table.
TableOperation insertCustomer1 = TableOperation.Insert(customer1);
// Submit the operation to the table service.
peopleTable.Execute(insertCustomer1);

Retrieve
// Create the table client.
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable peopleTable = tableClient.GetTableReference("people");
// Retrieve the entity with partition key of "Smith" and row key of "Jeff"
TableOperation retrieveJeffSmith =
TableOperation.Retrieve<CustomerEntity>("Smith", "Jeff");
// Retrieve entity
CustomerEntity specificEntity =
(CustomerEntity)peopleTable.Execute(retrieveJeffSmith).Result;

Table Storage – Important Points
• Azure Tables can store TBs of data
• Tables Operations are fast
• Tables are distributed –PartitionKey defines the partition
– A table might be stored in different partitions on different storage
devices.

Pricing

• How do I make querying 154 million email addresses as fast as
possible?
• foo@bar.com – the domain is the partition key and the alias is
the row key
• if I want 100GB of storage and I want to hit it 10 million times,
it’ll cost me $8 a month
• SQL Server will cost $176 a month - 22 times more expensive

Hadoop in the cloud
• Hadoop on Azure Cloud
• Some Facts:
– Bing ingests > 7 petabytes
a month
– The Twitter community generates over 1 terabyte
of tweets every day
– Cisco predicts that by 2013 annual internet traffic flowing will reach
667 exabytes
Where is my data HDInsight
Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

MapReduce – The BigData Power
• Map – takes input and output key;value pairs
(Key1,Value1)
(Key2,Value2)
:
:
(Keyn,Valuen)

MapReduce – The BigData Power
• Reduce – take group of values per key and produce new group
of values
Key1:
[value1-1,Value1-2…]
Key2:
[value2-1,Value2-2…]
Keyn:
[valueN-1,ValueN-2…]
[new_value1-1,new_value1-2…]
[new_value2-1,new_value2-2…]
[new_valueN-1,new_valueN-2…]
: :

MapReduce - How Does It Work?

So How Does It Work?

Finding common friends
• Facebook shows you how many common friends you have with
someone
• There were 1,310,000,000 active users in facebook
with 130 friends on average (01.01.2014)
• Calculating the mutual friends

Finding common friends
• We can represent Friend Relationship as:
• Note that a Friend relationship is Symmetrical
– if A is a friend of B then B is a friend of A
Someone  [List of hisher friends]
Common Friends

Example of Friends file
• U1 -> U2 U3 U4
• U2 -> U1 U3 U4 U5
• U3 -> U1 U2 U4 U5
• U4 -> U1 U2 U3 U5
• U5 -> U2 U3 U4
Where is my data HDInsight Common Friends

Designing our MapReduce job
• Each line from the file will input line to the Mapper
• The Mapper will output key-value pairs
• Key: (user, friend)
– Sorted, friend might be before user
• value: list of friends

Designing our MapReduce job - Mapper
• Each line from the file will input line to the Mapper
• The Mapper will output key-value pairs
• Key: (user, friend)
– Sorted, friend might be before user
• value: list of friends
• Having the key sorted will help us with the reducer, same pairs
will be provided together

Mapper Example
Mapper Output:Given the Line:
(U1 U2)  U2 U3 U4
(U1 U3)  U2 U3 U4
(U1 U4)  U2 U3 U4
U1U2 U3 U4

Mapper Example
(U1 U2)  U2 U3 U4
(U1 U3)  U2 U3 U4
(U1 U4)  U2 U3 U4
U1U2 U3 U4
(U1 U2) -> U1 U3 U4 U5
(U2 U3) -> U1 U3 U4 U5
(U2 U4) -> U1 U3 U4 U5
(U2 U5) -> U1 U3 U4 U5
U2  U1 U3 U4 U5

Mapper Example – final result
(U1 U2)  U2 U3 U4
(U1 U3)  U2 U3 U4
(U1 U4)  U2 U3 U4
U1U2 U3 U4
(U1 U2) -> U1 U3 U4 U5
(U2 U3) -> U1 U3 U4 U5
(U2 U4) -> U1 U3 U4 U5
(U2 U5) -> U1 U3 U4 U5
U2  U1 U3 U4 U5
(U1 U3) -> U1 U2 U4 U5
(U2 U3) -> U1 U2 U4 U5
(U3 U4) -> U1 U2 U4 U5
(U3 U5) -> U1 U2 U4 U5
U3 -> U1 U2 U4 U5
(U1 U4) -> U1 U2 U3 U5
(U2 U4) -> U1 U2 U3 U5
(U3 U4) -> U1 U2 U3 U5
(U4 U5) -> U1 U2 U3 U5
U4 -> U1 U2 U3 U5
(U2 U5) -> U2 U3 U4
(U3 U5) -> U2 U3 U4
(U4 U5) -> U2 U3 U4
U5 -> U2 U3 U4

Designing our MapReduce job - Reducer
• The input for the reducer will be structured as:
(friend1, friend2)  (friend1 friends) (friend2 friends)
• The reducer will find the intersection between the lists
• Output:
(friend1, friend2)  (intersection of friend1 and friend2 friends)

Reducer Example
Reducer Output:Given the Line:
(U1 U2) -> (U3 U4)(U1 U2) -> (U1 U3 U4 U5) (U2 U3 U4)
(U1 U3) -> (U2 U4)(U1 U3) -> (U1 U2 U4 U5) (U2 U3 U4)
(U1 U4) -> (U2 U3)(U1 U4) -> (U1 U2 U3 U5) (U2 U3 U4)
(U2 U3) -> (U1 U4 U5)(U2 U3) -> (U1 U2 U4 U5) (U1 U3 U4 U5)
(U2 U5) -> (U3 U4)(U2 U5) -> (U1 U3 U4 U5) (U2 U3 U4)
(U3 U5) -> (U2 U4)(U3 U5) -> (U1 U2 U4 U5) (U2 U3 U4)
(U4 U5) -> (U2 U3)(U4 U5) -> (U1 U2 U3 U5) (U2 U3 U4)

Creating c# MapReduce

Creating c# MapReduce - Mapper
public class CommonFriendsMapper:MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
var strings = inputLine.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries);
if (strings.Any())
{
var currentUser = strings[0];
var friends = strings.Skip(1);
foreach (var friend in friends)
{
var keyArr = new[] {currentUser, friend};
Array.Sort(keyArr);
var key = String.Join(" ", keyArr);
context.EmitKeyValue(key, string.Join(" ",friends));
}
}
}
}

Creating c# MapReduce - Reduce
public class CommonFriendsReducer:ReducerCombinerBase
{
public override void Reduce(string key,
IEnumerable<string> strings,
ReducerCombinerContext context)
{
var friendsLists = strings
.Select(friendList => friendList.Split(' '))
.ToList();
var intersection = friendsLists[0].Intersect(friendsLists[1]);
context.EmitKeyValue(key, string.Join(" ", intersection));
}
}

Creating c# MapReduce – Hadoop Job
HadoopJobConfiguration myConfig = new HadoopJobConfiguration();
myConfig.InputPath = "wasb:///example/data/friends/friends";
myConfig.OutputFolder = "wasb:////example/data/friends/output";
Environment.SetEnvironmentVariable("HADOOP_HOME", @"c:hadoop");
Environment.SetEnvironmentVariable("Java_HOME", @"c:hadoopjvm");
var hadoop = Hadoop.Connect(clusterUri,
clusterUserName,
hadoopUserName,
clusterPassword,
azureStorageAccount,
azureStorageKey,
azureStorageContainer,
createContinerIfNotExist);
var jobResult =
hadoop.MapReduceJob.Execute<CommonFriendsMapper, CommonFriendsReducer>(myConfig);
int exitCode = jobResult.Info.ExitCode; // (0 – success, otherwise – failure)

Pricing
10 node cluster that will exist for 24 hours:
• Secure Gateway Node - free.
• head node - 15.36 USD per 24-hour day
• 1 data node - 7.68 USD per 24-hour day
• 10 data nodes - 76.80 USD per 24-hour day
• Total: $92.16 USD

Comparing the alternatives
Storage Type When Should you Use Implications
BLOB Unstructured data
Files
- Application Logic Responsibility
- Consider using HDInsight(Hadoop)
SQL Server Structured Relational Data
ACID transactions
Max 150GB (500GB in preview)
- SQL DML+DDL
- Could affect scalability
- BI Abilities
- Reporting
Azure Tables Structured Data
Loose Schema
Geo Replication (High DR)
Auto Sharding
- OData, REST
- Application Logic
- Responsibility(Multiple Schemas)
Where is my data Wrap Up

What have we seen
• Azure Blobs
• Azure Tables
• Azure SQL Server
• HDinsight

What’s Next
• NoSql – MongoDB, Cassandra, CouchDB, RavenDB
• Hadoop ecosystem – Hive, Pig, SQOOP, Mahout
• http://blogs.msdn.com/b/windowsazure/
• http://blogs.msdn.com/b/windowsazurestorage/
• http://blogs.msdn.com/b/bigdatasupport/

Presenter contact details
c: +972-52-4772946
t: @tamir_dresher
e: tamirdr@codevalue.net
b: TamirDresher.com
w: www.codevalue.net

Where is my data (in the cloud) tamir dresher

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Where is my data (in the cloud) tamir dresher

Similar to Where is my data (in the cloud) tamir dresher (20)

More from Tamir Dresher

More from Tamir Dresher (20)

Recently uploaded

Recently uploaded (20)

Where is my data (in the cloud) tamir dresher

Editor's Notes