Where is my data (in the cloud) tamir dresher

  • 56 views
Uploaded on

Azure Storage Option together with best practices and methods to handle Large Amounts of data …

Azure Storage Option together with best practices and methods to handle Large Amounts of data

slides and recording can be found in my blog: http://blogs.microsoft.co.il/iblogger/2014/05/22/slides-from-where-is-my-data-in-the-cloud-webinar-19052014/

More in: Software , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
56
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Slide Objectives
    Understand the hierarchy of Blob storage

    Speaker Notes
    The Blob service provides storage for entities, such as binary files and text files.
    The REST API for the Blob service exposes two resources:
    Containers
    Blobs.
    A container is a set of blobs; every blob must belong to a container.
    The Blob service defines two types of blobs:
    Block blobs, which are optimized for streaming.
    Page blobs, which are optimized for random read/write operations and which provide the ability to write to a range of bytes in a blob.

    Blobs can be read by calling the Get Blob operation. A client may read the entire blob, or an arbitrary range of bytes.

    Block blobs less than or equal to 64 MB in size can be uploaded by calling the Put Blob operation.
    Block blobs larger than 64 MB must be uploaded as a set of blocks, each of which must be less than or equal to 4 MB in size.
    Page blobs are created and initialized with a maximum size with a call to Put Blob.
    To write content to a page blob, you call the Put Page operation. The maximum size currently supported for a page blob is 1 TB.

    Notes
    http://msdn.microsoft.com/en-us/library/dd573356.aspx
    Using the REST API for the Blob service, developers can create a hierarchical namespace similar to a file system. Blob names may encode a hierarchy by using a configurable path separator. For example, the blob names MyGroup/MyBlob1 and MyGroup/MyBlob2 imply a virtual level of organization for blobs. The enumeration operation for blobs supports traversing the virtual hierarchy in a manner similar to that of a file system, so that you can return a set of blobs that are organized beneath a group. For example, you can enumerate all blobs organized under MyGroup/.
  • Put Blob - Creates a new blob or replaces an existing blob within a container.
    Get Blob - Reads or downloads a blob from the system, including its metadata and properties.
    Delete Blob - Deletes a blob
    Copy Blob - Copies a source blob to a destination blob within the same storage account.
    SnapShot Blob - The Snapshot Blob operation creates a read-only snapshot of a blob.
    Lease Blob - Establishes an exclusive one-minute write lock on a blob. To write to a locked blob, a client must provide a lease ID.

    Using the REST API for the Blob service, developers can create a hierarchical namespace similar to a file system.
    Blob names may encode a hierarchy by using a configurable path separator. For example, the blob names MyGroup/MyBlob1 and MyGroup/MyBlob2 imply a virtual level of organization for blobs.
    The enumeration operation for blobs supports traversing the virtual hierarchy in a manner similar to that of a file system, so that you can return a set of blobs that are organized beneath a group.
    For example, you can enumerate all blobs organized under MyGroup/.

    Notes
    The Blob service provides storage for entities, such as binary files and text files. The REST API for the Blob service exposes two resources: containers and blobs. A container is a set of blobs; every blob must belong to a container. The Blob service defines two types of blobs:

    Block blobs, which are optimized for streaming. This type of blob is the only blob type available with versions prior to 2009-09-19.


    Page blobs, which are optimized for random read/write operations and which provide the ability to write to a range of bytes in a blob. Page blobs are available only with version 2009-09-19.


    Containers and blobs support user-defined metadata in the form of name-value pairs specified as headers on a request operation.

    Using the REST API for the Blob service, developers can create a hierarchical namespace similar to a file system. Blob names may encode a hierarchy by using a configurable path separator. For example, the blob names MyGroup/MyBlob1 and MyGroup/MyBlob2 imply a virtual level of organization for blobs. The enumeration operation for blobs supports traversing the virtual hierarchy in a manner similar to that of a file system, so that you can return a set of blobs that are organized beneath a group. For example, you can enumerate all blobs organized under MyGroup/.

    A block blob may be created in one of two ways. Block blobs less than or equal to 64 MB in size can be uploaded by calling the Put Blob operation. Block blobs larger than 64 MB must be uploaded as a set of blocks, each of which must be less than or equal to 4 MB in size. A set of successfully uploaded blocks can be assembled in a specified order into a single contiguous blob by calling Put Block List. The maximum size currently supported for a block blob is 200 GB.

    Page blobs are created and initialized with a maximum size with a call to Put Blob. To write content to a page blob, you call the Put Page operation. The maximum size currently supported for a page blob is 1 TB.

    Blobs support conditional update operations that may be useful for concurrency control and efficient uploading.

    Blobs can be read by calling the Get Blob operation. A client may read the entire blob, or an arbitrary range of bytes.

    For the Blob service API reference, see Blob Service API.
  • Locally Redundant Storage (LRS)
    Geographically Redundant Storage (GRS)
    Read-Access Geographically Redundant Storage (RA-GRS)
  • moshe@gmail.com, eli@gmail.com, me@gmail.com – was affected
    yossi@walla.co.il – not affected
  • moshe@gmail.com, eli@gmail.com, me@gmail.com – was affected
    yossi@walla.co.il – not affected

  • Notes
    http://msdn.microsoft.com/en-us/library/dd573356.aspx




  • moshe@gmail.com, eli@gmail.com, me@gmail.com – was affected
    yossi@walla.co.il – not affected
  • moshe@gmail.com, eli@gmail.com, me@gmail.com – was affected
    yossi@walla.co.il – not affected

Transcript

  • 1. Tamir Dresher Senior Software Architect May 19, 2014 Where is my Data? (In the Cloud)
  • 2. About Me • Software architect, consultant and instructor • Software Engineering Lecturer @ Ruppin Academic Center • Technology addict • 10 years of experience • .NET and Native Windows Programming @tamir_dresher tamirdr@codevalue.net http://www.TamirDresher.com.
  • 3. Agenda • Storage • Blob • Azure SQL Server • Azure Tables • HDInsight
  • 4. Agenda • Storage • Blob • Azure SQL Server • Azure Tables • HDInsight
  • 5. Storage Where is my data Storage
  • 6. Storage Prices 6
  • 7. Types of information Where is my data Storage
  • 8. North America Europe Asia Pacific Datacenters Windows Azure Growing Global Presence Storage SLA – 99.99% 52.56 minutes per year http://azure.microsoft.com/en-us/support/legal/sla
  • 9. AZURE BLOBS 9
  • 10. What is a BLOB • BLOB – Binary Large OBject • Storage for any type of entity such as binary files and text documents • Distributed File Service (DFS) – Scalability and High availability • BLOB file is distributed between multiple server and replicated at least 3 times Where is my data BLOB
  • 11. Blob Storage Concepts 11 Where is my data BLOB
  • 12. Blob Operations REST Where is my data BLOB
  • 13. DEMO Creating a Blob 13
  • 14. BLOBS • Block blob - up to 200 GB in size • Page blobs – up to 1 TB in size • Total Account Capacity - 500 TB • Pricing – Storage capacity used – Replication option (LRS, GRS, RA-GRS) – Number of requests – Data egress – http://azure.microsoft.com/en-us/pricing/details/storage/ Where is my data BLOB
  • 15. SQL AZURE 15
  • 16. SQL Azure • SQL Server in the cloud • No administrative overheads • High Availability • pay-as-you-grow pricing • Familiar Development Model* * Despite missing features and some limitations - http://msdn.microsoft.com/en-us/library/ff394115.aspx Where is my data SQL Azure
  • 17. DEMO Creating and Using SQL Azure 17
  • 18. SQL Azure – Pricing Where is my data SQL Azure
  • 19. Case Study - https://haveibeenpwned.com/ Where is my data SQL Azure
  • 20. Case Study - https://haveibeenpwned.com/ • http://www.troyhunt.com/2013/12/working-with-154-million- records-on.html • How do I make querying 154 million email addresses as fast as possible? • if I want 100GB of SQL Server and I want to hit it 10 million times, it’ll cost me $176 a month (now its ~20$) Where is my data SQL Azure
  • 21. AZURE TABLES 21
  • 22. Table Storage Concepts 22 Where is my data Tables
  • 23. Table Storage • Not RDBMS – No relationships between entities – NoSql • Entity can have up to 255 properties - Up to 1MB per entity • Mandatory Properties for every entity – PartitionKey & RowKey (only indexed properties) • Uniquely identifies an entity • Same RowKey can be used in different PartitionKey • Defines the sort order – Timestamp - Optimistic Concurrency Where is my data Tables
  • 24. No Fixed Schema 24 Where is my data Tables
  • 25. Table Object Model • ITableEntity interface –PartitionKey, RowKey, Timestamp, and Etag properties – Implemented by TableEntity and DynamicTableEntity // This class defines one additional property of integer type, // since it derives from TableEntity it will be automatically // serialized and deserialized. public class SampleEntity : TableEntity { public int SampleProperty { get; set; } } Where is my data Tables
  • 26. Sample – Inserting an Entity into a Table // You will need the following using statements using Microsoft.WindowsAzure.Storage; using Microsoft.WindowsAzure.Storage.Table; // Create the table client. CloudTableClient tableClient = storageAccount.CreateCloudTableClient(); CloudTable peopleTable = tableClient.GetTableReference("people"); peopleTable.CreateIfNotExists(); // Create a new customer entity. CustomerEntity customer1 = new CustomerEntity("Harp", "Walter"); customer1.Email = "Walter@contoso.com"; customer1.PhoneNumber = "425-555-0101"; // Create an operation to add the new customer to the people table. TableOperation insertCustomer1 = TableOperation.Insert(customer1); // Submit the operation to the table service. peopleTable.Execute(insertCustomer1); Where is my data Tables
  • 27. Retrieve // Create the table client. CloudTableClient tableClient = storageAccount.CreateCloudTableClient(); CloudTable peopleTable = tableClient.GetTableReference("people"); // Retrieve the entity with partition key of "Smith" and row key of "Jeff" TableOperation retrieveJeffSmith = TableOperation.Retrieve<CustomerEntity>("Smith", "Jeff"); // Retrieve entity CustomerEntity specificEntity = (CustomerEntity)peopleTable.Execute(retrieveJeffSmith).Result; Where is my data Tables
  • 28. Table Storage – Important Points • Azure Tables can store TBs of data • Tables Operations are fast • Tables are distributed –PartitionKey defines the partition – A table might be stored in different partitions on different storage devices. Where is my data Tables
  • 29. Pricing Where is my data Tables
  • 30. Case Study - https://haveibeenpwned.com/ Where is my data Tables
  • 31. Case Study - https://haveibeenpwned.com/ • How do I make querying 154 million email addresses as fast as possible? • foo@bar.com – the domain is the partition key and the alias is the row key • if I want 100GB of storage and I want to hit it 10 million times, it’ll cost me $8 a month • SQL Server will cost $176 a month - 22 times more expensive Where is my data Tables
  • 32. HDINSIGHT 32
  • 33. Hadoop in the cloud • Hadoop on Azure Cloud • Some Facts: – Bing ingests > 7 petabytes a month – The Twitter community generates over 1 terabyte of tweets every day – Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes Where is my data HDInsight Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
  • 34. MapReduce – The BigData Power • Map – takes input and output key;value pairs (Key1,Value1) (Key2,Value2) : : (Keyn,Valuen) Where is my data HDInsight
  • 35. MapReduce – The BigData Power • Reduce – take group of values per key and produce new group of values Key1: [value1-1,Value1-2…] Key2: [value2-1,Value2-2…] Keyn: [valueN-1,ValueN-2…] [new_value1-1,new_value1-2…] [new_value2-1,new_value2-2…] [new_valueN-1,new_valueN-2…] : : Where is my data HDInsight
  • 36. MapReduce - How Does It Work? Where is my data HDInsight
  • 37. So How Does It Work? Where is my data HDInsight
  • 38. Finding common friends • Facebook shows you how many common friends you have with someone • There were 1,310,000,000 active users in facebook with 130 friends on average (01.01.2014) • Calculating the mutual friends Where is my data HDInsight
  • 39. Finding common friends • We can represent Friend Relationship as: • Note that a Friend relationship is Symmetrical – if A is a friend of B then B is a friend of A Where is my data HDInsight Someone  [List of hisher friends] Common Friends
  • 40. Example of Friends file • U1 -> U2 U3 U4 • U2 -> U1 U3 U4 U5 • U3 -> U1 U2 U4 U5 • U4 -> U1 U2 U3 U5 • U5 -> U2 U3 U4 Where is my data HDInsight Common Friends
  • 41. Designing our MapReduce job • Each line from the file will input line to the Mapper • The Mapper will output key-value pairs • Key: (user, friend) – Sorted, friend might be before user • value: list of friends Where is my data HDInsight Common Friends
  • 42. Designing our MapReduce job - Mapper • Each line from the file will input line to the Mapper • The Mapper will output key-value pairs • Key: (user, friend) – Sorted, friend might be before user • value: list of friends • Having the key sorted will help us with the reducer, same pairs will be provided together Where is my data HDInsight Common Friends
  • 43. Mapper Example Where is my data HDInsight Common Friends Mapper Output:Given the Line: (U1 U2)  U2 U3 U4 (U1 U3)  U2 U3 U4 (U1 U4)  U2 U3 U4 U1U2 U3 U4
  • 44. Mapper Example Where is my data HDInsight Common Friends Mapper Output:Given the Line: (U1 U2)  U2 U3 U4 (U1 U3)  U2 U3 U4 (U1 U4)  U2 U3 U4 U1U2 U3 U4 (U1 U2) -> U1 U3 U4 U5 (U2 U3) -> U1 U3 U4 U5 (U2 U4) -> U1 U3 U4 U5 (U2 U5) -> U1 U3 U4 U5 U2  U1 U3 U4 U5
  • 45. Mapper Example – final result Where is my data HDInsight Common Friends Mapper Output:Given the Line: (U1 U2)  U2 U3 U4 (U1 U3)  U2 U3 U4 (U1 U4)  U2 U3 U4 U1U2 U3 U4 (U1 U2) -> U1 U3 U4 U5 (U2 U3) -> U1 U3 U4 U5 (U2 U4) -> U1 U3 U4 U5 (U2 U5) -> U1 U3 U4 U5 U2  U1 U3 U4 U5 (U1 U3) -> U1 U2 U4 U5 (U2 U3) -> U1 U2 U4 U5 (U3 U4) -> U1 U2 U4 U5 (U3 U5) -> U1 U2 U4 U5 U3 -> U1 U2 U4 U5 Mapper Output:Given the Line: (U1 U4) -> U1 U2 U3 U5 (U2 U4) -> U1 U2 U3 U5 (U3 U4) -> U1 U2 U3 U5 (U4 U5) -> U1 U2 U3 U5 U4 -> U1 U2 U3 U5 (U2 U5) -> U2 U3 U4 (U3 U5) -> U2 U3 U4 (U4 U5) -> U2 U3 U4 U5 -> U2 U3 U4
  • 46. Designing our MapReduce job - Reducer • The input for the reducer will be structured as: (friend1, friend2)  (friend1 friends) (friend2 friends) • The reducer will find the intersection between the lists • Output: (friend1, friend2)  (intersection of friend1 and friend2 friends) Where is my data HDInsight Common Friends
  • 47. Reducer Example Where is my data HDInsight Common Friends Reducer Output:Given the Line: (U1 U2) -> (U3 U4)(U1 U2) -> (U1 U3 U4 U5) (U2 U3 U4) (U1 U3) -> (U2 U4)(U1 U3) -> (U1 U2 U4 U5) (U2 U3 U4) (U1 U4) -> (U2 U3)(U1 U4) -> (U1 U2 U3 U5) (U2 U3 U4) (U2 U3) -> (U1 U4 U5)(U2 U3) -> (U1 U2 U4 U5) (U1 U3 U4 U5) (U2 U4) -> (U1 U3 U5)(U2 U4) -> (U1 U2 U3 U5) (U1 U3 U4 U5) (U2 U5) -> (U3 U4)(U2 U5) -> (U1 U3 U4 U5) (U2 U3 U4) (U3 U4) -> (U1 U2 U5)(U3 U4) -> (U1 U2 U3 U5) (U1 U2 U4 U5) (U3 U5) -> (U2 U4)(U3 U5) -> (U1 U2 U4 U5) (U2 U3 U4) (U4 U5) -> (U2 U3)(U4 U5) -> (U1 U2 U3 U5) (U2 U3 U4)
  • 48. Creating c# MapReduce Where is my data HDInsight Common Friends
  • 49. Creating c# MapReduce - Mapper Where is my data HDInsight Common Friends public class CommonFriendsMapper:MapperBase { public override void Map(string inputLine, MapperContext context) { var strings = inputLine.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries); if (strings.Any()) { var currentUser = strings[0]; var friends = strings.Skip(1); foreach (var friend in friends) { var keyArr = new[] {currentUser, friend}; Array.Sort(keyArr); var key = String.Join(" ", keyArr); context.EmitKeyValue(key, string.Join(" ",friends)); } } } }
  • 50. Creating c# MapReduce - Reduce Where is my data HDInsight Common Friends public class CommonFriendsReducer:ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> strings, ReducerCombinerContext context) { var friendsLists = strings .Select(friendList => friendList.Split(' ')) .ToList(); var intersection = friendsLists[0].Intersect(friendsLists[1]); context.EmitKeyValue(key, string.Join(" ", intersection)); } }
  • 51. Creating c# MapReduce – Hadoop Job Where is my data HDInsight Common Friends HadoopJobConfiguration myConfig = new HadoopJobConfiguration(); myConfig.InputPath = "wasb:///example/data/friends/friends"; myConfig.OutputFolder = "wasb:////example/data/friends/output"; Environment.SetEnvironmentVariable("HADOOP_HOME", @"c:hadoop"); Environment.SetEnvironmentVariable("Java_HOME", @"c:hadoopjvm"); var hadoop = Hadoop.Connect(clusterUri, clusterUserName, hadoopUserName, clusterPassword, azureStorageAccount, azureStorageKey, azureStorageContainer, createContinerIfNotExist); var jobResult = hadoop.MapReduceJob.Execute<CommonFriendsMapper, CommonFriendsReducer>(myConfig); int exitCode = jobResult.Info.ExitCode; // (0 – success, otherwise – failure)
  • 52. Pricing Where is my data HDInsight 10 node cluster that will exist for 24 hours: • Secure Gateway Node - free. • head node - 15.36 USD per 24-hour day • 1 data node - 7.68 USD per 24-hour day • 10 data nodes - 76.80 USD per 24-hour day • Total: $92.16 USD
  • 53. WRAP UP 53
  • 54. Comparing the alternatives Storage Type When Should you Use Implications BLOB Unstructured data Files - Application Logic Responsibility - Consider using HDInsight(Hadoop) SQL Server Structured Relational Data ACID transactions Max 150GB (500GB in preview) - SQL DML+DDL - Could affect scalability - BI Abilities - Reporting Azure Tables Structured Data Loose Schema Geo Replication (High DR) Auto Sharding - OData, REST - Application Logic - Responsibility(Multiple Schemas) Where is my data Wrap Up
  • 55. What have we seen • Azure Blobs • Azure Tables • Azure SQL Server • HDinsight Where is my data Wrap Up
  • 56. What’s Next • NoSql – MongoDB, Cassandra, CouchDB, RavenDB • Hadoop ecosystem – Hive, Pig, SQOOP, Mahout • http://blogs.msdn.com/b/windowsazure/ • http://blogs.msdn.com/b/windowsazurestorage/ • http://blogs.msdn.com/b/bigdatasupport/ Where is my data Wrap Up
  • 57. Presenter contact details c: +972-52-4772946 t: @tamir_dresher e: tamirdr@codevalue.net b: TamirDresher.com w: www.codevalue.net