Your SlideShare is downloading. ×
NoSQL for the SQL Server Pro
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NoSQL for the SQL Server Pro

4,137
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,137
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
67
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • From the O’Reilly / Strata “Getting Ready for Big Data” Report…“the three Vs of volume, velocity and variety are commonlyused to characterize different aspects of big data”
  • http://www.inboundlogistics.com/cms/article/m2m-101/http://www.freebase.com/Hilary Mason’s datasets - https://bitly.com/bundles/hmason/1
  • Facial recognition database - http://www.face-rec.org/databases/http://usahitman.com/unexpected-fr/
  • https://www.eff.org/deeplinks/2012/09/indias-gargantuan-biometric-database-raises-big-questions
  • http://hint.fm/wind/
  • http://d3js.org/
  • http://hadoop.apache.org/http://en.wikipedia.org/wiki/Apache_Hadoop
  • http://www.oracle.com/technetwork/bdc/hadoop-loader/overview/index.htmlhttp://www.microsoft.com/download/en/details.aspx?id=27584
  • http://hortonworks.com/technology/hortonworksdataplatform/More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report“Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.”http://www.cloudera.com/
  • http://hortonworks.com/technology/hortonworksdataplatform/More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report“Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.”http://www.cloudera.com/
  • https://www.hadooponazure.com/AccountDemo - http://www.youtube.com/watch?v=XcHz8aUDDN8 and http://www.youtube.com/watch?v=c7oHntP8HBI
  • https://www.hadooponazure.com/AccountDemo - http://www.youtube.com/watch?v=XcHz8aUDDN8 and http://www.youtube.com/watch?v=c7oHntP8HBI
  • Original Reference: Tom White’s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  • http://lynnlangit.wordpress.com/2011/11/09/relational-cloud-storage-is-50x-more-expensive-than-nosql/
  • http://nosql-database.org/http://hadoop.apache.org/ & http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases – http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
  • http://bigdatanerd.wordpress.com/2012/01/04/why-nosql-part-2-overview-of-data-modelrelational-nosql/http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/
  • http://en.wikipedia.org/wiki/Project_Voldemorthttp://aws.amazon.com/http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/Introduction.htmlhttp://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
  • http://code.google.comAccess via REST APIsVery Cheap, but not much functionality includedLots of code to write for application developmentBut…can be a good backup solution
  • http://stage.hypertable.com/index.php/documentation/architecture/http://code.google.com/appengine/http://code.google.com/appengine/articles/datastore/overview.html
  • http://cwebbbi.wordpress.com/2012/02/14/so-what-is-the-bi-semantic-model/http://www.databasejournal.com/features/mssql/understanding-new-column-store-index-of-sql-server-2012.htmlhttp://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.htmlhttp://ayende.com/blog/4500/that-no-sql-thing-column-family-databases
  • http://en.wikipedia.org/wiki/MongoDBhttp://www.mongodb.org/downloadshttp://www.mongodb.org/display/DOCS/Drivers
  • http://en.wikipedia.org/wiki/MongoDBhttp://www.mongodb.org/downloadshttp://www.mongodb.org/display/DOCS/Drivers
  • http://www.infinitegraph.com/what-is-a-graph-database.htmlhttp://en.wikipedia.org/wiki/Graph_databasehttp://www.freebase.com/
  • http://berb.github.com/diploma-thesis/original/061_challenge.htmlhttp://nosqltips.blogspot.com/2011/04/cap-diagram-for-distribution.htmlhttp://blog.mccrory.me/2010/11/03/cap-theorem-and-the-clouds/http://amitpiplani.blogspot.com/2010/05/u-pick-2-selection-for-nosql-providers.htmlACID VS BASEACIDRDBMS are the predominant database systems currently in use for most applications, including web applications. Their data model and their internals are strongly connected to transactional behaviour when operating on data. However, transactional behaviour is not solely related to RDBMS, but is also used for other systems. A set of properties describes the guarantees that database transactions generally provide in order to maintain the validity of data [Hae83].AtomicityThis property determines that a transaction executes with a "all or nothing" manner. A transaction can either be a single operation, or a sequence of operations resp. sub-transactions. As a result, a transaction either succeeds and the database state is changed, or the database remains unchanged, and all operations are dismissed.ConsistencyIn context of transactions, we define consistency as the transition from one valid state to another, never leaving behind an invalid state when executing transactions.IsolationThe concept of isolation guarantees that no transaction sees premature operations of other running transactions. In essence, this prevents conflicts between concurrent transactions.DurabilityThe durability property assures persistence of executed transactions. Once a transaction has committed, the outcome of the transaction such as state changes are kept, even in case of a crash or other failures.Strongly adhering to the principles of ACID results in an execution order that has the same effect as a purely serial execution. In other words, there is always a serially equivalent order of transactions that represents the exact same state [Dol05]. It is obvious that ensuring a serializable order negatively affects performance and concurrency, even when a single machine is used. In fact, some of the properties are often relaxed to a certain extent in order to improve performance. A weaker isolation level between transactions is the most used mechanism to speed up transactions and their throughput. Stepwise, a transactional system can leave serializablity and fall back to the weaker isolation levels repeatable reads, read committed and read uncommitted. These levels gradually remove range locks, read locks and resp. write locks (in that order). As a result, concurrent transactions are less isolated and can see partial results of other transactions, yielding so called read phenomena. Some implementations also weaken the durability property by not guaranteeing to write directly to disk. Instead, committed states are buffered in memory and eventually flushed to disk. This heavily decreases latencies at the cost of data integrity.Consistency is still a core property of the ACID model, that cannot be relaxed easily. The mutual dependencies of the properties make it impossible to remove a single property without affecting the others. Referring back to the CAP theorem, we have seen the trade-off between consistency and availability regarding distributed database systems that must tolerate partitions. In case we choose a database system that follows the ACID paradigm, we cannot guarantee high availability anymore. The usage of ACID as part of a distributed systems yields the need of distributed transactions or similar mechanisms for preserving the transactional properties when state is shared and sharded onto multiple nodes.Now let us reconsider what would happen if we evict the burden of distributed transactions. As we are talking about distributed systems, we have no global shared state by default. The only knowledge we have is a per-node knowledge of its own past. Having no global time, no global now, we cannot inherently have atomic operations on system level, as operations occur at different times on different machines. This softens isolation and we must leave the notion of global state for now. Having no immediate, global state of the system in turn endangers durability.In conclusion, building distributed systems adhering to the ACID paradigm is a demanding challenge. It requires complex coordination and synchronization efforts between all involved nodes, and generates considerable communication overhead within the entire system. It is not for nothing that distributed transactions are feared by many architects [Hel09,Alv11]. Although it is possible to build such systems, some of the original motivations for using a distributed database system have been mitigated on this path. Isolation and serializablity contradict scalability and concurrency. Therefore, we will now consider an alternative model for consistency that sacrifices consistency for other properties that are interesting for certain systems.BASEThis alternative consistency model is basically the subsumption of properties resulting from a system that provides availability and partition tolerance, but no strong consistency [Pri08]. While a strong consistency model as provided by ACID implies that all subsequent reads after a write yield the new, updated state for all clients and on all nodes of the system, this is weakened for BASE. Instead, the weak consistency of BASE comes up with an inconsistency window, a period in which the propagation oft the update is not yet guaranteed.Basically availableThe availability of the system even in case of failures represents a strong feature of the BASE model.Soft stateNo strong consistency is provided and clients must accept stale state under certain circumstances.Eventually consistentConsistency is provided in a "best effort" manner, yielding a consistent state as soon as possible.The optimistic behaviour of BASE represents a best effort approach towards consistency, but is also said to be simpler and faster. Availability and scaling capacities are primary objectives at the expense of consistency. This has an impact on the application logic, as the developer must be aware of the possibility of stale data. On the other hand, favoring availability over consistency has also benefits for the application logic in some scenarios. For instance, a partial system failure of an ACID system might reject write operations, forcing the application to handle the data to be written somehow. A BASE system in turn might always accept writes, even in case of network failures, but they might not be visible for other nodes immediately. The applicability of relaxed consistency models depends very much on the application requirements. Strict constraints of balanced accounts for a banking application do not fit eventual consistency naturally. But many web applications built around social objects and user interactions can actually accept slightly stale data. When the inconsistency window is on average smaller than the time between request/response cycles of user interactions, a user might not even realize any kind of inconsistency at all.http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
  • For Google - http://code.google.comFor AWS - https://console.aws.amazon.com/console/home
  • Hadoop on AWS - http://wiki.apache.org/hadoop/AmazonEC2
  • http://rickosborne.org/download/SQL-to-MongoDB.pdf
  • About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/R language - http://www.r-project.org/Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/Julia language -- http://julialang.org/There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories:RESTful – simple, standardsETL – Pig (Hadoop) is an exampleQuery – Hive (again Hadoop), lots of *QLAnalyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output
  • http://www.youtube.com/watch?v=gjsMDAcI1Mo - analysthttp://www.youtube.com/watch?v=_MT04szKlyohttp://aws.amazon.com/articles/9574327584309154
  • http://www.microsoft.com/en-us/bi/default.aspxhttp://dennyglee.com/Demos -   http://www.youtube.com/watch?v=djfpPsGwm6Aand http://www.youtube.com/watch?v=uh9bKWO1K7U
  • http://research.google.com/pubs/pub36632.html
  • http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
  • DataMarkets – InfoChimps, Factual, DataMarket, Windows Azure Data Marketplace, Wolfram Alpha, Datasifthttp://www.microsoft.com/en-us/sqlazurelabs/default.aspx andhttp://www.microsoft.com/en-us/sqlazurelabs/labs/dataexplorer.aspxhttps://datamarket.azure.com/http://www.freebase.com/http://code.google.com/p/google-refine/
  • Lynn
  • Transcript

    • 1. NoSQL for the DBA Lynn Langit April 2013 – Big Data Tech Con
    • 2. Data Expertise / Lynn Langit• Industry awards – Microsoft – MVP for SQL Server – Google – GDE for Cloud Platform – 10Gen – Master for MongoDB• Practicing Architect• Technical author / trainer – Pluralsight – Google Cloud Series – DevelopMentor – SQL Server Series – 2 books on SQL Server BI – Cloudera trainer (certified)• Former MSFT FTE – 4 years
    • 3. but first…Business Intelligence to BigData
    • 4. What is the relationship? Business NoSQL ???? Intelligence
    • 5. “The Past” BI = Effective ReportsDataoptimized forStaticREADING
    • 6. BI = Optimized RDBMSSQL queries & Data Stored on disk
    • 7. BI = OLAP Cubes storage
    • 8. BI = OLAP Cubes clients
    • 9. BI = Transactional Data Collecting • What happened?Transactional • Why did that happen? data • Decision Support Systems
    • 10. So Why Change?
    • 11. Enter Big DataQ: What is it?A: Your Data, plus more data….
    • 12. BigData Pipeline - STEP 1 – AcquireAcquire Process Store Query & Mine Visualize
    • 13. Big Data – an example from weather 13
    • 14. Big Data – an example from weather• Source Data • National weather data • Satellite data • Airplanes with sensors • Sensors on boats • Sensors in the ocean • Sensors on the ground • Historical Data • Social Media• Results • More accurate predictions • Tsunami • Tornado
    • 15. Big Data – an example from health care• Medical records • Regular • Emergency • Genetic data – 23andMe• Food data • SparkPeople• Purchasing • Grocery card • credit card• Search – Google• Social media • Twitter • Facebook• Exercise • Nike Fuel Band • Kinect • Location - phone
    • 16. BigData = ‘Next State’ Questions • What could happen? Collecting • Why didn’t this happen? • When will the next new thing Behavioral happen? data • What will the next new thing be? • What happens?
    • 17. What is the reality of personalized medicine?25002000 Key Monitoring1500 Sensor Readings1000500 Other Behavioral 0 data 12:00 12:30 1:00 1:30 2:00 2:30
    • 18. BigData and Verticals • Retail • Manufacturing • Health Care • Banking • Education
    • 19. Collecting BigData• Sensors everywhere• Structured, Semi-structured, Unstructured vs. Data Standards• M2M• Public Datasets – Freebase – Azure DataMarket – Hillary Mason’s list 19
    • 20. DEMO – Hilary Mason’s Datasets• Who is Hilary Mason and why do you care about her datasets?• How do you get her datasets?• What do you do with her datasets?
    • 21. Collecting Data – a note about Faces• Facial recognition• Voice recognition• Gesture capture and analysis 21
    • 22. Petabytes of Big Data
    • 23. Big Data at Apple
    • 24. Big Data in IndiaUpdate: “The total number of AADHAARs issued as of 24-Mar-2013 is over 304 million. This is more than 25% of thepopulation of India.”
    • 25. BigData Pipeline – STEP 5 - VisualizeAcquire Process Store Query & Mine Visualize
    • 26. DEMO - Visualizing Big Data: Wind Map 26
    • 27. Demo - Visualizing Big Data – D3 27
    • 28. BigData Pipeline – STEP 2 - ProcessAcquire Process Store Query & Mine Visualize
    • 29. How do you clean up the mess?• Data Hygiene• Data Scrubbing• Data Sprawl• The true cost of data• …and what about data integrity?• …and security?• …should your data be in the cloud?
    • 30. Is NoSQL just Hadoop?HUGE Hype factor since 2011Apache Hadoop• a software framework that supports data-intensive distributed applications• under a free license enables applications to work with thousands of nodes and petabytes of data• was inspired by Googles MapReduce and Google File System (GFS) papers
    • 31. What is the relationship?NoSQL Hadoop ??? BigData
    • 32. Hadoop in the Enterprise
    • 33. How you ‘get’ HadoopOpen source• roll your ownCommercial distribution• Cloudera• MapR• Hortonworks• More…Rent it via the cloud• AWS
    • 34. Demo – Get and Use Cloudera CDH4 VM
    • 35. Working with Hadoop
    • 36. About Hadoop MapReduce Image from - https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
    • 37. Demo - HDInsight – MapReduce w/Java
    • 38. Demo - HDInsight – MapReduce w/ Hive
    • 39. Example Comparison: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes and greaterAccess Interactive and Batch Batch – NOT InteractiveUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearQuery Can be near immediate Has latency (due to batchResponse processing)Time
    • 40. BigData Pipeline STEP 3 – StoreAcquire Process Store Query & Mine Visualize
    • 41. “Small” BigData vs. “Big” BigData Hadoop NoSQL RDBMS
    • 42. The reality…two pivotsStorage Methods Storage Locations• SQL (RDBMS) • On premises• NoSQL or Hadoop • Cloud-hosted
    • 43. Cloud-hosted NoSQL up to 50x CHEAPER
    • 44. So many NoSQL options• More than just the Elephant in the room• Over 120+ types of NoSQL databases
    • 45. Flavors of NoSQLKey/Value Key/value Wide-Column Document GraphVolatile Persistent
    • 46. Key / Value Database• Just keys and values – No schema• Persistent or Volatile• Examples – AWS Dynamo DB – Riak
    • 47. DEMO - AWS DynamoDB• Key/Value store on the AWS cloud
    • 48. NoSQL BLOB Storage Buckets in the Cloud• Amazon – S3 or Glacier• Google – Cloud Storage• Microsoft Azure BLOBS• Others – Dropbox – Box – More…
    • 49. DEMO - Battle of the Buckets• Google Cloud Storage VS.• Windows Azure BLOBS VS.• AWS S3 / Glacier
    • 50. Column Database• Wide, sparse column sets • Schema-light• Examples: – Cassandra – HBase w/Hadoop – BigTable – GAE HR DS
    • 51. Types of Column Databases• Column-families – Non-relational – Sparse – Examples: • HBase • Cassandra • xVelocity (SQL 2012 Tabular)• Column-stores – Relational – Dense – Example: • SQL Server 2012 – Columnstore index
    • 52. DEMO – SQL Server ‘NoSQL’• SQL Server 2012 Columnstore Index• SQL Server 2012 Tabular Model (SSAS)
    • 53. Document Database (Mongo DB)• document-oriented (collection of JSON documents) w/semi structured data – Encodings include BSON, JSON, XML…• binary forms – PDF, Microsoft Office documents -- Word, Excel…)• Examples: – MongoDB – Couchbase
    • 54. Demo - Mongo DB
    • 55. Graph Databases• a lot of many-to-many relationships• recursive self-joins• when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data• Examples: – Neo4J – Google Freebase
    • 56. DEMO – Neo4J
    • 57. CAP Theorem applied = ‘how big is it?’• CA = RDBMS – Highly-available consistency – Ex. SQL Server• CP = NoSQL – Enforced consistency – Ex. Hadoop• AP = NoSQL – Eventual consistency – Ex. MongoDB
    • 58. “Small” BigData vs. “Big” BigData Hadoop Key/Value or Column Document or Graph RDBMS
    • 59. Cloud-hosted RDBMS• AWS RDS – SQL Server, mySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling• Google – mySQL – Lowest cost – Most limited RDBMS functionality• Microsoft – SQLAzure – Highest cost
    • 60. DEMO - AWS RDS• SQL Server, MySQL or Oracle• Essential to understand pricing models
    • 61. Image - http://blog.outsourcing-partners.com/wp-content/uploads/2012/10/performance.png
    • 62. NoSQL AppliedColumnstore Log FilesHBase Key/Value Product Catalogs DynamoDB Document Social Games MongoDB Graph Social aggregators Neo4j RDBMS Line-of-Business SQL Server
    • 63. Cloud Offerings– RDBMS AND NoSQL AWS Google MicrosoftRDBMS RDS – all major mySQL SQL AzureNoSQL buckets S3 or Glacier Cloud Storage Azure BlobsNoSQL Key-Value DynamoDB H/R Data on GAE Azure TablesStreaming ML or Custom EC2 Prospective Search StreamInsight(Mahout) & Prediction APINoSQL Document or MongoDB on EC2 Freebase MongoDB onGraph Windows AzureNoSQL – Column Elastic MapReduce none HDInsightHadoop (HBase) using S3 & EC2Dremel/Warehousi RedShift BigQuery noneng
    • 64. BigData Pipeline STEP 4 – QueryAcquire Process Store Query & Mine Visualize
    • 65. Always MapReduce?
    • 66. Data Scientists and Languages
    • 67. Karmasphere Studio for AWS
    • 68. Can Excel help?• Connector to Hadoop• Data Explorer• Data Quality Services• Master Data Services• Integration with Azure Data Market• Visualize with PowerView• Data Mining w/Predixion
    • 69. Demo - Hadoop Connector to Excel
    • 70. Google BigQuery w/Excel• Hadoop-like (Dremel) based service• For massive amounts of data• SQL-like query language
    • 71. DEMO - Google BigQuery• Hadoop-like (Dremel) based service• For massive amounts of data• SQL-like query language
    • 72. Dremel Realized => Impala• Interactive Hadoop?
    • 73. Other types of cloud data servicesHosting public datasets Cleaning / matching (your)• Pay to read data• Earn revenue by offering for read • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Data Market, InfoChimps, DataMarket .com
    • 74. NoSQL To-Do ListUnderstand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problemTry out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environmentsLearn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc…
    • 75. The Changing Data Landscape Other ServicesRDBMS NoSQL
    • 76. • recipes) www.TeachingKidsProgramming.org • Free Courseware ( • Do a Recipe  Teach a Kid (Ages 10 ++) • Java or Microsoft SmallBasic  TKP site • C# via Pluralsight
    • 77. Toward Data Craftsmanship… Follow me @LynnLangit RSS my blog www.LynnLangit.com Hire me • To help build your BI/Big Data solution • To teach your team next gen BI • To learn more about using NoSQL solutions