Big Data and NoSQL for Database and BI Pros

2,616 views

Published on

Big Data and NoSQL for Database and BI Pros - PASS Business Analytics Conference 2013

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
  • too basic to be useful for anything. there are a lot more insightful presentations from slideshare on this subject from established open sources pros in the field ;)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,616
On SlideShare
0
From Embeds
0
Number of Embeds
69
Actions
Shares
0
Downloads
187
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide
  • http://www.chegg.com/textbooks/foundations-of-sql-server-2008-r2-business-intelligence-2nd-edition-9781430233244-1430233249http://www.chegg.com/textbooks/smart-business-intelligence-solutions-with-microsoft-sql-server-2008-1st-edition-9780735625808-0735625808
  • http://www.chantcafe.com/2010/10/reality-in-catholic-music-massive.html
  • https://developers.google.com/bigquery/docs/browser_toolDremel -- http://research.google.com/pubs/pub36632.html
  • http://nosql-database.org/http://hadoop.apache.org/ & http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases – http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
  • When the volume of data is too much for simple human interpretation ->Man PLUS Machine (Data Mining / Statistics)
  • http://bigdatanerd.wordpress.com/2012/01/04/why-nosql-part-2-overview-of-data-modelrelational-nosql/http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/
  • http://rickosborne.org/download/SQL-to-MongoDB.pdf
  • About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/R language - http://www.r-project.org/Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories:RESTful – simple, standardsETL – Pig (Hadoop) is an exampleQuery – Hive (again Hadoop), lots of *QLAnalyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output
  • Big Data and NoSQL for Database and BI Pros

    1. 1. April 10-12 | Chicago, ILBig Data and NoSQL forDatabase and BI ProsAndrew J. Brust, Founder and CEO, Blue Badge Insights
    2. 2. April 10-12 | Chicago, ILPlease silencecell phones
    3. 3. Meet AndrewCEO and Founder, Blue Badge InsightsBig Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust3
    4. 4. Andrew‟s New Blog (bit.ly/bigondata)
    5. 5. Lynn Langit (in absentia)CEO and Founder, Lynn Langit consultingFormer Microsoft Evangelist (4 years)Google Developer ExpertMongoDB MasterMCT 13 years – 7 certificationsCloudera Certified DeveloperMSDN Magazine articles• SQL Azure• Hadoop on Azure• MongoDB on Azurewww.LynnLangit.com@LynnLangitL
    6. 6. Read all about it!
    7. 7. AgendaOverview / Landscape• Big Data, and Hadoop• NoSQL• The Big Data-NoSQL IntersectionDrilldown on Big DataDrilldown on NoSQL
    8. 8. What is Big Data?100s of TB into PB and higherInvolving data from: financial data, sensors, web logs, social media, etc.Parallel processing often involvedHadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactional databasesAnalyzing interactions, rather than transactionsThe three V‟s: Volume, Velocity, VarietyBig Data tech sometimes imposed on small data problems
    9. 9. Big Data = Exponentially More DataRetail Example -> „Feedback Economy‟• Number of transactions• Number of behaviors (collected every minute)9L
    10. 10. Big Data = „Next State‟ Questions10• What could happen?• Why didn‟t this happen?• When will the next new thinghappen?• What will the next new thing be?• What happens?CollectingBehavioraldataL
    11. 11. My Data: An Example from Health CareMedical records• Regular• Emergency• Genetic data – 23andMeFood data• SparkPeoplePurchasing• Grocery card• credit cardSearch – GoogleSocial media• Twitter• FacebookExercise• Nike Fuel Band• Kinect• Location - phone11L
    12. 12. Big Data = More Data12L
    13. 13. Big Data ConsiderationsCollection –get the dataStorage –keep thedataQuerying –make senseof the dataVisualization– see thebusinessvalueL
    14. 14. Data CollectionTypes of Data• Structured, semi-structured, unstructured vs. data standards• Behavioral vs. transactional dataMethods of collection• Sensors everywhere• Machine-2-Machine• Public Datasets• Freebase• Azure DataMarket• Hillary Mason‟s list14L
    15. 15. What‟s MapReduce?Partition the bulk input data and send to mappers (nodes in cluster)Mappers pre-process, put into key-value format, and send all output fora given (set of) key(s) to a reducerReducer aggregates; one output per key, with valueMap and Reduce code natively written as Java functions
    16. 16. MapReduce, in a DiagrammappermappermappermappermappermapperInputreducerreducerreducerInputInputInputInputInputInputOutputOutputOutputOutputOutputOutputOutputInputInputInputK1 , K4K3 , K6OutputOutputOutputK2 , K5
    17. 17. • Count by suite, on each floor• Send per-suite, per platform totals to lobby• Sort totals by platform• Send two platform packets to 10th, 20th, 30th floor• Tally up each platform• Merge tallies into one spreadsheet• Collect the talliesA MapReduce Example
    18. 18. What‟s a Distributed File System?One where data gets distributed over commodity drives on commodityserversData is replicated• If one box goes down, no data lost• “Shared Nothing”BUT: Immutable• Files can only be written to once• So updates require drop + re-write (slow)• You can append though• Like a DVD/CD-ROM
    19. 19. Hadoop = MapReduce + HDFSModeled after Google MapReduce + GFSHave more data? Just add more nodes to cluster.• Mappers execute in parallel• Hardware is commodity• “Scaling out”Use of HDFS means data may well be local to mapper processing• So, not just parallel, but minimal data movement, which avoidsnetwork bottlenecks
    20. 20. Comparison: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Updates Read / Write many times Write once, Read many timesIntegrity High (ACID) LowQuery Response Time Can be near immediate Has latency (due to batchprocessing)20L
    21. 21. Just-in-Time SchemaWhen looking at unstructured data, schema is imposed at query timeSchema is context specific• If scanning a book, are the values words, lines, or pages?• Are notes a single field, or is each word a value?• Are date and time two fields or one?• Are street, city, state, zip separate or one value?• Pig and Hive let you determine this at query time• So does the Map function in MapReduce code
    22. 22. What‟s HBase?A Wide-Column Store NoSQL databaseModeled after Google BigTableUses HDFSTherefore, Hadoop-compatibleHadoop MapReduce often used with HBaseBut you can use either without the other
    23. 23. L
    24. 24. NoSQL ConfusionMany „flavors‟ of NoSQL data storesEasiest to group by functionality, but…• Dividing lines are not clear or consistentNoSQL choice(s) driven by many factors• Type of data• Quantity of data• Knowledge of technical staff• Product maturity• ToolingL
    25. 25. So much wrong informationEverything is„new‟People arereligious aboutdata storageLots ofincorrectinformation„Try‟ beforeyou „buy‟ (oruse)Watch out foroversimplificationConfusion overvendorofferingsL
    26. 26. Common NoSQL MisconceptionsProblemsEverything is „new‟People are religious about datastorageOpen source is always cheaperCloud is always cheaperReplace RDBMS with NoSQLSolutions„Try‟ before you „buy‟ (or use)Leverage NoSQL communitiesAdd NoSQL to existing RDBMSsolutionL
    27. 27. April 10-12 | Chicago, ILDrilldown on Big Data
    28. 28. The Hadoop StackMapReduce, HDFSDatabaseRDBMS Import/ExportQuery: HiveQL and Pig LatinMachine Learning/Data MiningLog file integration
    29. 29. What‟s Hive?Began as Hadoop sub-projectNow top-level Apache projectProvides a SQL-like (“HiveQL”) abstraction over MapReduceHas its own HDFS table file format (and it‟s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabular data
    30. 30. Hadoop DistributionsClouderaHortonworksHCatalog: Hive/Pig/MR InteropMapRNetwork File System replaces HDFSIBM InfoSphere BigInsightsHDFS<->DB2 integrationAnd now Microsoft…
    31. 31. Microsoft HDInsightDeveloped with Hortonworks and incorporates Hortonworks DataPlatform (HDP) for WindowsWindows Azure HDInsight and Microsoft HDInsight (for WindowsServer)• Single node preview runs on Windows clientIncludes ODBC Driver for HiveJavaScript MapReduce frameworkContribute it all back to open source Apache Project
    32. 32. Hortonworks Data Platformfor WindowsMRLib (NuGetPackage)LINQ to HiveOdbcClient + HiveODBC DriverDeploymentDebuggingMR code inC#, HadoopJob, MapperBase, ReducerBaseAmenities forVisual Studio/.NET
    33. 33. Some ways to workMicrosoft HDInsight• Cloud: go to www.windowsazure.com, request a cluster• Local: Download Microsoft HDInsight• Runs on just about anything, including Windows XP• Get it via the Web Platform installer (WebPI)• Local version is free; cloud billed at 50% discount during previewAmazon Web Services Elastic MapReduce• Create AWS account• Select Elastic MapReduce in Dashboard• Cheap for experimenting, but not freeCloudera CDH VM image• Download as .tar.gz file• “Un-tar” (can use WinRAR, 7zip)• Run via VMWare Player or Virtual Box• Everything’s free
    34. 34. Some ways to workHDInsight EMR CDH 4
    35. 35. Microsoft HDInsightMuch simpler than the othersBrowser-based portal• Launch MapReduce jobs• Azure: Provisioning cluster, managing ports, gather external dataInteractive JavaScript & Hive console• JS: HDFS, Pig, light data visualization• Hive commands and metadata discovery• New console comingDesktop Shortcuts:• Command window, MapReduce, Name Node status in browser• Azure: from portal page you can RDP directly to Hadoop head node for thesedesktop shortcuts35
    36. 36. April 10-12 | Chicago, ILDemoWindows Azure HDInsight
    37. 37. Amazon Elastic MapReduceLots of steps!At a high level:• Setup AWS account and S3 “buckets”• Generate Key Pair and PEM file• Install Ruby and EMR Command Line Interface• Provision the cluster using CLI• A batch file can work very well here• Setup and run SSH/PuTTY• Work interactively at command line
    38. 38. April 10-12 | Chicago, ILDemoAmazon Elastic MapReduce
    39. 39. Cloudera CDH4 Virtual MachineGet it for free, in VMWare and Virtual Box versions.• VMWare player and Virtual Box are free tooRun it, and configure it to have its own IP on your network. Use ifconfig todiscover IP.Assuming IP of 192.168.1.59, open browser on your own (host) machine andnavigate to:• http://192.168.1.59:8888Can also use browser in VM and hit:• http://localhost:8888Work in “Hue”…
    40. 40. HueBrowser based UI, with frontends for:HDFS (w/ upload & download)MapReduce job creation andmonitoringHive (“Beeswax”)And in-browser command lineshells for:HBasePig (“Grunt”)
    41. 41. Impala: What it IsDistributed SQL query engine over Hadoop clusterAnnounced at Strata/Hadoop World in NYC on October 24thIn Beta, as part of CDH 4.1Works with HDFS and Hive dataCompatible with HiveQL and Hive drivers• Query with Beeswax
    42. 42. Impala: What it‟s NotImpala is not Hive• Hive converts HiveQL to Java MapReduce code and executes it in batchmode• Impala executes query interactively over the data• Brings BI tools and Hadoop closer togetherImpala is not an Apache Software Foundation project• Though it is open source and Apache-licensed, but it‟s still incubated byCloudera• Only in CDH
    43. 43. April 10-12 | Chicago, ILDemoCloudera CDH4, Impala
    44. 44. Hadoop commandsHDFS• hadoop fs filecommand• Create and remove directories• mkdir, rm, rmr• Upload and download files to/from HDFS• get, put• View directory contents• ls, lsr• Copy, move, view files• cp, mv, catMapReduce• Run a Java jar-file based job• hadoop jar jarname params
    45. 45. April 10-12 | Chicago, ILDemoHadoop (directly)
    46. 46. HBaseConcepts:• Tables, column families• Columns, rows• Keys, valuesCommands:• Definition: create, alter, drop, truncate• Manipulation: get, put, delete, deleteall, scan• Discovery: list, exists, describe, count• Enablement: disable, enable• Utilities: version, status, shutdown, exit• Reference: http://wiki.apache.org/hadoop/Hbase/ShellMoreover,• Interesting HBase work can be done in MapReduce, Pig
    47. 47. HBase Examplescreate t1, f1, f2, f3describe t1alter t1, {NAME => f1, VERSIONS => 5}put t1, r1, c1:f1, valueget t1, r1count t1
    48. 48. April 10-12 | Chicago, ILDemoHBase
    49. 49. Submitting, Running and MonitoringJobsUpload a JARUse Streaming• Use other languages (i.e. other than Java) to write MapReduce code• Python is popular option• Any executable works, even C# console apps• On MS HDInsight, JavaScript works too• Still uses a JAR file: streaming.jarRun at command line (passing JAR name and params) or use GUI
    50. 50. April 10-12 | Chicago, ILDemoRunning MapReduce Jobs
    51. 51. HiveUsed by most BI products which connect to HadoopProvides a SQL-like abstraction over Hadoop• Officially HiveQL, or HQLWorks on own tables, but also on HBaseQuery generates MapReduce job, output of which becomes result setMicrosoft has Hive ODBC driver• Connects Excel, Reporting Services, PowerPivot, Analysis Services TabularMode (only)
    52. 52. Hive, ContinuedLoad data from flat HDFS files• LOAD DATA [LOCAL] INPATH myfileINTO TABLE mytable;SQL Queries• CREATE, ALTER, DROP• INSERT OVERWRITE (creates whole tables)• SELECT, JOIN, WHERE, GROUP BY• SORT BY, but ordering data is tricky!• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce stepsutilizing Java or streaming code
    53. 53. Data Explorer• Beta add-in for Excel• Acquire, transform data• Data sources includeFacebook, HDFS• Visually- or script-driven• Also includes Azure BLOBstorage backing upHDInsight56
    54. 54. April 10-12 | Chicago, ILDemoHive, Data Explorer
    55. 55. PigInstead of SQL, employs a language (“Pig Latin”) that accommodates data flowexpressions• Do a combo of Query and ETL“10 lines of Pig Latin ≈ 200 lines of Java.”Works with structured or unstructured dataOperations• As with Hive, a MapReduce job is generated• Unlike Hive, output is only flat file to HDFS or text at command line console• With HDInsight, can easily convert to JavaScript array, then manipulateUse command line (“Grunt”) or build scripts
    56. 56. ExampleA = LOAD myfileAS (x, y, z);B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO output;
    57. 57. Pig Latin ExamplesImperative, file system commands• LOAD, STORE•Schema specified on LOADDeclarative, query commands (SQL-like)• xxx = file or data set• FOREACH xxx GENERATE (SELECT…FROM xxx)• JOIN (WHERE/INNER JOIN)• FILTER xxx BY (WHERE)• ORDER xxx BY (ORDER BY)• GROUP xxx BY / GENERATE COUNT(xxx)(SELECT COUNT(*) GROUP BY)• DISTINCT (SELECT DISTINCT)Syntax is assignment statement-based:• MyCusts = FILTER Custs BY SalesPerson eq 15;Access Hbase• CpuMetrics = LOAD hbase://SystemMetrics USINGorg.apache.pig.backend.hadoop.hbase.HBaseStorage(cpu:,-loadKey -returnTuple);
    58. 58. April 10-12 | Chicago, ILDemoPig
    59. 59. Sqoopsqoop import--connect"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"--table <from_table>--target-dir <to_hdfs_folder>--split-by <from_table_column>
    60. 60. Sqoopsqoop export--connect"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"--table <to_table>--export-dir <from_hdfs_folder>--input-fields-terminated-by"<delimiter>"
    61. 61. Flume NGSource• Avro (data serialization system – can read json-encoded data files, and canwork over RPC)• Exec (reads from stdout of long-running process)Sinks• HDFS, HBase, AvroChannels• Memory, JDBC, file
    62. 62. Flume NG (next generation)Setup conf/flume.conf# Define a memory channel called ch1 on agent1agent1.channels.ch1.type = memory# Define an Avro source called avro-source1 on agent1 and tell it# to bind to 0.0.0.0:41414. Connect it to channel ch1.agent1.sources.avro-source1.channels = ch1agent1.sources.avro-source1.type = avroagent1.sources.avro-source1.bind = 0.0.0.0agent1.sources.avro-source1.port = 41414# Define a logger sink that simply logs all events it receives# and connect it to the other end of the same channel.agent1.sinks.log-sink1.channel = ch1agent1.sinks.log-sink1.type = logger# Finally, now that weve defined all of our components, tell# agent1 which ones we want to activate.agent1.channels = ch1agent1.sources = avro-source1agent1.sinks = log-sink1From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
    63. 63. Mahout AlgorithmsRecommendation• Your info + community info• Give users/items/ratings; get user-user/item-item• itemsimilarityClassification/Categorization• Drop into buckets• Naïve Bayes, Complementary Naïve Bayes, Decision ForestsClustering• Like classification, but with categories unknown• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift
    64. 64. Workflow, SyntaxWorkflow• Run the job• Dump the output• Visualize, predictmahout algorithm-- input folderspec-- output folderspec-- param1 value1-- param2 value2…Example:• mahout itemsimilarity--input <input-hdfs-path>--output <output-hdfs-path>--tempDir <tmp-hdfs-path>-s SIMILARITY_LOGLIKELIHOOD
    65. 65. The Truth About MahoutMahout is really just an algorithm engineIts output is almost unusable by non-statisticians/non-data scientistsYou need a staff or a product to visualize, or make into a usableprediction modelInvestigate Predixion Software• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team• Excel add-in can use Mahout remotely, visualize its output, run predictiveanalyses• Also integrates with SQL Server, Greenplum, MapReduce• http://www.predixionsoftware.com
    66. 66. The “Data-Refinery” IdeaUse Hadoop to “on-board” unstructured data, then extract manageablesubsetsLoad the subsets into conventional DW/BI servers and use familiaranalytics tool to examineThis is the current rationalization of Hadoop + BI tools‟ coexistenceWill it stay this way?
    67. 67. Dremel-based service for massive amounts of dataPay for query and storageSQL-like query languageHas an Excel connectorGoogle BigQueryL
    68. 68. April 10-12 | Chicago, ILGoogle BigQuery
    69. 69. April 10-12 | Chicago, ILDrilldown on NoSQL
    70. 70. NoSQL Data FodderAddresses PreferencesNotesFriends, FollowersDocuments
    71. 71. “Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for involume”• Millions of concurrent usersThink of sites like Amazon or GoogleThink of non-transactional tasks like loadingcatalog data to display product page, orenvironment preferences
    72. 72. NoSQL Common TraitsNon-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies
    73. 73. More than just the Elephant in the roomOver 120+ types of noSQL databasesSo many NoSQL optionsL
    74. 74. ConceptsConsistencyCAP TheoremIndexingQueriesMapReduceSharding
    75. 75. ConsistencyCAP Theorem• Databases may only excel at two of the following three attributes:consistency, availability and partition toleranceNoSQL does not offer “ACID” guarantees• Atomicity, consistency, isolation and durabilityInstead offers “eventual consistency”Similar to DNS propagation
    76. 76. Things like inventory, account balances should be consistent• Imagine updating a server in Seattle that stock was depleted• Imagine not updating the server in NY• Customer in NY goes to order 50 pieces of the item• Order processed even though no stockThings like catalog information don‟t have to be, at least not immediately• If a new item is entered into the catalog, it‟s OK for some customers to see iteven before the other customers‟ server knows about itBut catalog info must come up quickly• Therefore don‟t lock data in one location while waiting to update the otherTherefore, OK to sacrifice consistency for speed, in some casesConsistency
    77. 77. CAP TheoremConsistencyAvailabilityPartitionToleranceRelationalNoSQL
    78. 78. IndexingMost NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which isappend-only• Writes are logged• Logged writes are batched• File is re-created and sorted
    79. 79. QueriesTypically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…
    80. 80. MapReduceThis is not Hadoop‟s MapReduce, but it‟s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB
    81. 81. L
    82. 82. ShardingA partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recoverySince “shards” can be geographically distributed, sharding can act like aCDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place
    83. 83. NoSQL CategoriesGraphWide ColumnDocumentKey/ValueL
    84. 84. Key-Value StoresThe most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to rowCommon on cloud platforms• e.g. Amazon SimpleDB, Azure Table StorageMemcacheDB, Voldemort, Couchbase, DynamoDB(AWS), Dynomite, Redis and Riak87
    85. 85. Key-Value StoresTable: CustomersRow ID: 101First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502Table: OrdersRow ID: 1501Price: 300 USDItem1: 52134Item2: 24457Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428Database
    86. 86. Wide Column StoresHas tables with declared column families• Each column family has “columns” which are KV pairs that can vary from row to rowThese are the most foundational for large sites• BigTable (Google)• HBase (Originally part of Yahoo-dominated Hadoop project)• Cassandra (Facebook)• Calls column families “super columns” and tables “super column families”They are the most “Big Data”-ready• Especially HBase + Hadoop
    87. 87. Table: CustomersRow ID: 101Super Column: NameColumn: First_Name:AndrewColumn: Last_Name: BrustSuper Column: AddressColumn: Number: 123Column: Street: Main StreetSuper Column: OrdersColumn: Last_Order: 1501Table: OrdersRow ID: 1501Super Column: PricingColumn: Price: 300USDSuper Column: ItemsColumn: Item1: 52134Column: Item2: 24457Row ID: 1502Super Column: PricingColumn: Price: 2500GBPSuper Column: ItemsColumn: Item1: 98456Column: Item2: 59428Row ID: 202Super Column: NameColumn: First_Name: JaneColumn: Last_Name: DoeSuper Column: AddressColumn: Number: 321Column: Street: Elm StreetSuper Column: OrdersColumn: Last_Order: 1502Wide Column Stores
    88. 88. April 10-12 | Chicago, ILDemoWide Column Stores
    89. 89. Document StoresHave “databases,” which are akin to tablesHave “documents,” akin to rows• Documents are typically JSON objects• Each document has properties and values• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. containedJSON objects - Allows for hierarchical storage)• Can have attachments as wellOld versions are retained• So Doc Stores work well for content managementSome view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:• CouchDB• Derivatives• MongoDB
    90. 90. Document Store Application OrientationDocuments can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON• Documents are JSON objects• CouchDB/MongoDB use JavaScript as native languageIn CouchDB, “view functions” also have unique URIs and they returnHTML• So you can build entire applications in the database
    91. 91. Database: CustomersDocument ID: 101First_Name: AndrewLast_Name: BrustAddress:Orders:Database: OrdersDocument ID: 1501Price: 300 USDItem1: 52134Item2: 24457Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428Number: 123Street: Main StreetMost_recent: 1501Document ID: 202First_Name: JaneLast_Name: DoeAddress:Orders:Number: 321Street: Elm StreetMost_recent: 1502Document Stores
    92. 92. Comparing…
    93. 93. April 10-12 | Chicago, ILDemoDocument Stores
    94. 94. Graph DatabasesGreat for social network applications and others where relationships areimportantNodes and edges• Edge like a join• Nodes like rows in a tableNodes can also have properties and valuesNeo4j is a popular graph db
    95. 95. DatabaseSent invitationtoCommented onphoto byFriendofAddressPlaced orderItem2Item1Joe Smith JaneDoeAndrew BrustStreet: 123 MainStreetCity: New YorkState: NYZip: 10014ID: 52134Type: DressColor: BlueID: 24457Type: ShirtColor: RedID: 252Total Price: 300USDGeorge WashingtonGraph Databases
    96. 96. NoSQL on Windows AzurePlatform as a Service• Cloudant: https://cloudant.com/azure/• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/MongoDB, DIY:• On an Azure Worker Role:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/
    97. 97. NoSQL on Windows AzureOthers, DIY (Linux VMs):• Couchbase:http://blog.couchbase.com/couchbase-server-new-windows-azure• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-installer-for-windows-azure• Riak:http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/
    98. 98. NoSQL + BINoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”• See next slideWide-column stores and column-oriented databases are similartechnologically
    99. 99. NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and HadoopWhy?• Lack of indexing not a problem• Consistency not an issue• Fast reads very important• Distributed file systems important too• Commodity hardware and disk assumptions also important• Not Web scale but massive scale-out, so similar concerns
    100. 100. NoSQL CompromisesEventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)
    101. 101. Common DBA Tasks in NoSQLRDBMS NoSQLImport Data Import DataSetup Security Setup SecurityPerform a Backup Make a copy of the dataRestore a Database Move a copy to a locationCreate an Index Create an IndexJoin Tables Together Run MapReduceSchedule a Job Schedule a (Cron) JobRun Database Maintenance Monitor space and resources usedSend an Email from SQL Server Set up resource threshold alertsSearch BOL Interpret Documentation104L
    102. 102. Which Type of NoSQL forWhich Type of Data?Type of Data Type of NoSQL solution ExampleLog files Wide Column HBaseProduct Catalogs Key Value on disk DynamoDBUser profiles Key Value in memory RedisStartups Document MongoDBSocial media connections Graph Neo4jLOB w/Transactions NONE! Use RDBMS SQL Server105L
    103. 103. Relational vs. NoSQLLine of Business -> RelationalLarge, public (consumer)-facing sites -> NoSQLComplex data structures -> RelationalBig Data -> NoSQLTransactional -> RelationalContent Management -> NoSQLEnterprise->RelationalConsumer Web -> NoSQL
    104. 104. Data Scientists…L
    105. 105. Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problemTry out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training environmentsLearn NoSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET• New query tools (vendor-specific) – Google Refine, AmazonKarmasphere, Microsoft Excel connectors, etc…NoSQL To-Do ListL
    106. 106. NoSQL for .NET DevelopersRavenDBMongoDB C#/.NET DriverMongoDB on Windows AzureCouchBase .NET Client LibraryRiak client for .NETAWS Toolkit for Visual StudioGoogle cloud APIs (REST-based)
    107. 107. Thank You• andrew.brust@bluebadgeinsights.com• @andrewbrust on twitter• Want to get on Blue Badge Insights‟ list?”Text “bluebadge” to 22828
    108. 108. April 10-12 | Chicago, ILThank you!Diamond Sponsor

    ×