Your SlideShare is downloading. ×
0
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Data and NoSQL for Database and BI Pros

1,933

Published on

Big Data and NoSQL for Database and BI Pros - PASS Business Analytics Conference 2013

Big Data and NoSQL for Database and BI Pros - PASS Business Analytics Conference 2013

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
  • too basic to be useful for anything. there are a lot more insightful presentations from slideshare on this subject from established open sources pros in the field ;)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,933
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
164
Comments
1
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http://www.chegg.com/textbooks/foundations-of-sql-server-2008-r2-business-intelligence-2nd-edition-9781430233244-1430233249http://www.chegg.com/textbooks/smart-business-intelligence-solutions-with-microsoft-sql-server-2008-1st-edition-9780735625808-0735625808
  • http://www.chantcafe.com/2010/10/reality-in-catholic-music-massive.html
  • https://developers.google.com/bigquery/docs/browser_toolDremel -- http://research.google.com/pubs/pub36632.html
  • http://nosql-database.org/http://hadoop.apache.org/ & http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases – http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
  • When the volume of data is too much for simple human interpretation ->Man PLUS Machine (Data Mining / Statistics)
  • http://bigdatanerd.wordpress.com/2012/01/04/why-nosql-part-2-overview-of-data-modelrelational-nosql/http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/
  • http://rickosborne.org/download/SQL-to-MongoDB.pdf
  • About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/R language - http://www.r-project.org/Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories:RESTful – simple, standardsETL – Pig (Hadoop) is an exampleQuery – Hive (again Hadoop), lots of *QLAnalyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output
  • Transcript

    • 1. April 10-12 | Chicago, ILBig Data and NoSQL forDatabase and BI ProsAndrew J. Brust, Founder and CEO, Blue Badge Insights
    • 2. April 10-12 | Chicago, ILPlease silencecell phones
    • 3. Meet AndrewCEO and Founder, Blue Badge InsightsBig Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust3
    • 4. Andrew‟s New Blog (bit.ly/bigondata)
    • 5. Lynn Langit (in absentia)CEO and Founder, Lynn Langit consultingFormer Microsoft Evangelist (4 years)Google Developer ExpertMongoDB MasterMCT 13 years – 7 certificationsCloudera Certified DeveloperMSDN Magazine articles• SQL Azure• Hadoop on Azure• MongoDB on Azurewww.LynnLangit.com@LynnLangitL
    • 6. Read all about it!
    • 7. AgendaOverview / Landscape• Big Data, and Hadoop• NoSQL• The Big Data-NoSQL IntersectionDrilldown on Big DataDrilldown on NoSQL
    • 8. What is Big Data?100s of TB into PB and higherInvolving data from: financial data, sensors, web logs, social media, etc.Parallel processing often involvedHadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactional databasesAnalyzing interactions, rather than transactionsThe three V‟s: Volume, Velocity, VarietyBig Data tech sometimes imposed on small data problems
    • 9. Big Data = Exponentially More DataRetail Example -> „Feedback Economy‟• Number of transactions• Number of behaviors (collected every minute)9L
    • 10. Big Data = „Next State‟ Questions10• What could happen?• Why didn‟t this happen?• When will the next new thinghappen?• What will the next new thing be?• What happens?CollectingBehavioraldataL
    • 11. My Data: An Example from Health CareMedical records• Regular• Emergency• Genetic data – 23andMeFood data• SparkPeoplePurchasing• Grocery card• credit cardSearch – GoogleSocial media• Twitter• FacebookExercise• Nike Fuel Band• Kinect• Location - phone11L
    • 12. Big Data = More Data12L
    • 13. Big Data ConsiderationsCollection –get the dataStorage –keep thedataQuerying –make senseof the dataVisualization– see thebusinessvalueL
    • 14. Data CollectionTypes of Data• Structured, semi-structured, unstructured vs. data standards• Behavioral vs. transactional dataMethods of collection• Sensors everywhere• Machine-2-Machine• Public Datasets• Freebase• Azure DataMarket• Hillary Mason‟s list14L
    • 15. What‟s MapReduce?Partition the bulk input data and send to mappers (nodes in cluster)Mappers pre-process, put into key-value format, and send all output fora given (set of) key(s) to a reducerReducer aggregates; one output per key, with valueMap and Reduce code natively written as Java functions
    • 16. MapReduce, in a DiagrammappermappermappermappermappermapperInputreducerreducerreducerInputInputInputInputInputInputOutputOutputOutputOutputOutputOutputOutputInputInputInputK1 , K4K3 , K6OutputOutputOutputK2 , K5
    • 17. • Count by suite, on each floor• Send per-suite, per platform totals to lobby• Sort totals by platform• Send two platform packets to 10th, 20th, 30th floor• Tally up each platform• Merge tallies into one spreadsheet• Collect the talliesA MapReduce Example
    • 18. What‟s a Distributed File System?One where data gets distributed over commodity drives on commodityserversData is replicated• If one box goes down, no data lost• “Shared Nothing”BUT: Immutable• Files can only be written to once• So updates require drop + re-write (slow)• You can append though• Like a DVD/CD-ROM
    • 19. Hadoop = MapReduce + HDFSModeled after Google MapReduce + GFSHave more data? Just add more nodes to cluster.• Mappers execute in parallel• Hardware is commodity• “Scaling out”Use of HDFS means data may well be local to mapper processing• So, not just parallel, but minimal data movement, which avoidsnetwork bottlenecks
    • 20. Comparison: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Updates Read / Write many times Write once, Read many timesIntegrity High (ACID) LowQuery Response Time Can be near immediate Has latency (due to batchprocessing)20L
    • 21. Just-in-Time SchemaWhen looking at unstructured data, schema is imposed at query timeSchema is context specific• If scanning a book, are the values words, lines, or pages?• Are notes a single field, or is each word a value?• Are date and time two fields or one?• Are street, city, state, zip separate or one value?• Pig and Hive let you determine this at query time• So does the Map function in MapReduce code
    • 22. What‟s HBase?A Wide-Column Store NoSQL databaseModeled after Google BigTableUses HDFSTherefore, Hadoop-compatibleHadoop MapReduce often used with HBaseBut you can use either without the other
    • 23. L
    • 24. NoSQL ConfusionMany „flavors‟ of NoSQL data storesEasiest to group by functionality, but…• Dividing lines are not clear or consistentNoSQL choice(s) driven by many factors• Type of data• Quantity of data• Knowledge of technical staff• Product maturity• ToolingL
    • 25. So much wrong informationEverything is„new‟People arereligious aboutdata storageLots ofincorrectinformation„Try‟ beforeyou „buy‟ (oruse)Watch out foroversimplificationConfusion overvendorofferingsL
    • 26. Common NoSQL MisconceptionsProblemsEverything is „new‟People are religious about datastorageOpen source is always cheaperCloud is always cheaperReplace RDBMS with NoSQLSolutions„Try‟ before you „buy‟ (or use)Leverage NoSQL communitiesAdd NoSQL to existing RDBMSsolutionL
    • 27. April 10-12 | Chicago, ILDrilldown on Big Data
    • 28. The Hadoop StackMapReduce, HDFSDatabaseRDBMS Import/ExportQuery: HiveQL and Pig LatinMachine Learning/Data MiningLog file integration
    • 29. What‟s Hive?Began as Hadoop sub-projectNow top-level Apache projectProvides a SQL-like (“HiveQL”) abstraction over MapReduceHas its own HDFS table file format (and it‟s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabular data
    • 30. Hadoop DistributionsClouderaHortonworksHCatalog: Hive/Pig/MR InteropMapRNetwork File System replaces HDFSIBM InfoSphere BigInsightsHDFS<->DB2 integrationAnd now Microsoft…
    • 31. Microsoft HDInsightDeveloped with Hortonworks and incorporates Hortonworks DataPlatform (HDP) for WindowsWindows Azure HDInsight and Microsoft HDInsight (for WindowsServer)• Single node preview runs on Windows clientIncludes ODBC Driver for HiveJavaScript MapReduce frameworkContribute it all back to open source Apache Project
    • 32. Hortonworks Data Platformfor WindowsMRLib (NuGetPackage)LINQ to HiveOdbcClient + HiveODBC DriverDeploymentDebuggingMR code inC#, HadoopJob, MapperBase, ReducerBaseAmenities forVisual Studio/.NET
    • 33. Some ways to workMicrosoft HDInsight• Cloud: go to www.windowsazure.com, request a cluster• Local: Download Microsoft HDInsight• Runs on just about anything, including Windows XP• Get it via the Web Platform installer (WebPI)• Local version is free; cloud billed at 50% discount during previewAmazon Web Services Elastic MapReduce• Create AWS account• Select Elastic MapReduce in Dashboard• Cheap for experimenting, but not freeCloudera CDH VM image• Download as .tar.gz file• “Un-tar” (can use WinRAR, 7zip)• Run via VMWare Player or Virtual Box• Everything’s free
    • 34. Some ways to workHDInsight EMR CDH 4
    • 35. Microsoft HDInsightMuch simpler than the othersBrowser-based portal• Launch MapReduce jobs• Azure: Provisioning cluster, managing ports, gather external dataInteractive JavaScript & Hive console• JS: HDFS, Pig, light data visualization• Hive commands and metadata discovery• New console comingDesktop Shortcuts:• Command window, MapReduce, Name Node status in browser• Azure: from portal page you can RDP directly to Hadoop head node for thesedesktop shortcuts35
    • 36. April 10-12 | Chicago, ILDemoWindows Azure HDInsight
    • 37. Amazon Elastic MapReduceLots of steps!At a high level:• Setup AWS account and S3 “buckets”• Generate Key Pair and PEM file• Install Ruby and EMR Command Line Interface• Provision the cluster using CLI• A batch file can work very well here• Setup and run SSH/PuTTY• Work interactively at command line
    • 38. April 10-12 | Chicago, ILDemoAmazon Elastic MapReduce
    • 39. Cloudera CDH4 Virtual MachineGet it for free, in VMWare and Virtual Box versions.• VMWare player and Virtual Box are free tooRun it, and configure it to have its own IP on your network. Use ifconfig todiscover IP.Assuming IP of 192.168.1.59, open browser on your own (host) machine andnavigate to:• http://192.168.1.59:8888Can also use browser in VM and hit:• http://localhost:8888Work in “Hue”…
    • 40. HueBrowser based UI, with frontends for:HDFS (w/ upload & download)MapReduce job creation andmonitoringHive (“Beeswax”)And in-browser command lineshells for:HBasePig (“Grunt”)
    • 41. Impala: What it IsDistributed SQL query engine over Hadoop clusterAnnounced at Strata/Hadoop World in NYC on October 24thIn Beta, as part of CDH 4.1Works with HDFS and Hive dataCompatible with HiveQL and Hive drivers• Query with Beeswax
    • 42. Impala: What it‟s NotImpala is not Hive• Hive converts HiveQL to Java MapReduce code and executes it in batchmode• Impala executes query interactively over the data• Brings BI tools and Hadoop closer togetherImpala is not an Apache Software Foundation project• Though it is open source and Apache-licensed, but it‟s still incubated byCloudera• Only in CDH
    • 43. April 10-12 | Chicago, ILDemoCloudera CDH4, Impala
    • 44. Hadoop commandsHDFS• hadoop fs filecommand• Create and remove directories• mkdir, rm, rmr• Upload and download files to/from HDFS• get, put• View directory contents• ls, lsr• Copy, move, view files• cp, mv, catMapReduce• Run a Java jar-file based job• hadoop jar jarname params
    • 45. April 10-12 | Chicago, ILDemoHadoop (directly)
    • 46. HBaseConcepts:• Tables, column families• Columns, rows• Keys, valuesCommands:• Definition: create, alter, drop, truncate• Manipulation: get, put, delete, deleteall, scan• Discovery: list, exists, describe, count• Enablement: disable, enable• Utilities: version, status, shutdown, exit• Reference: http://wiki.apache.org/hadoop/Hbase/ShellMoreover,• Interesting HBase work can be done in MapReduce, Pig
    • 47. HBase Examplescreate t1, f1, f2, f3describe t1alter t1, {NAME => f1, VERSIONS => 5}put t1, r1, c1:f1, valueget t1, r1count t1
    • 48. April 10-12 | Chicago, ILDemoHBase
    • 49. Submitting, Running and MonitoringJobsUpload a JARUse Streaming• Use other languages (i.e. other than Java) to write MapReduce code• Python is popular option• Any executable works, even C# console apps• On MS HDInsight, JavaScript works too• Still uses a JAR file: streaming.jarRun at command line (passing JAR name and params) or use GUI
    • 50. April 10-12 | Chicago, ILDemoRunning MapReduce Jobs
    • 51. HiveUsed by most BI products which connect to HadoopProvides a SQL-like abstraction over Hadoop• Officially HiveQL, or HQLWorks on own tables, but also on HBaseQuery generates MapReduce job, output of which becomes result setMicrosoft has Hive ODBC driver• Connects Excel, Reporting Services, PowerPivot, Analysis Services TabularMode (only)
    • 52. Hive, ContinuedLoad data from flat HDFS files• LOAD DATA [LOCAL] INPATH myfileINTO TABLE mytable;SQL Queries• CREATE, ALTER, DROP• INSERT OVERWRITE (creates whole tables)• SELECT, JOIN, WHERE, GROUP BY• SORT BY, but ordering data is tricky!• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce stepsutilizing Java or streaming code
    • 53. Data Explorer• Beta add-in for Excel• Acquire, transform data• Data sources includeFacebook, HDFS• Visually- or script-driven• Also includes Azure BLOBstorage backing upHDInsight56
    • 54. April 10-12 | Chicago, ILDemoHive, Data Explorer
    • 55. PigInstead of SQL, employs a language (“Pig Latin”) that accommodates data flowexpressions• Do a combo of Query and ETL“10 lines of Pig Latin ≈ 200 lines of Java.”Works with structured or unstructured dataOperations• As with Hive, a MapReduce job is generated• Unlike Hive, output is only flat file to HDFS or text at command line console• With HDInsight, can easily convert to JavaScript array, then manipulateUse command line (“Grunt”) or build scripts
    • 56. ExampleA = LOAD myfileAS (x, y, z);B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO output;
    • 57. Pig Latin ExamplesImperative, file system commands• LOAD, STORE•Schema specified on LOADDeclarative, query commands (SQL-like)• xxx = file or data set• FOREACH xxx GENERATE (SELECT…FROM xxx)• JOIN (WHERE/INNER JOIN)• FILTER xxx BY (WHERE)• ORDER xxx BY (ORDER BY)• GROUP xxx BY / GENERATE COUNT(xxx)(SELECT COUNT(*) GROUP BY)• DISTINCT (SELECT DISTINCT)Syntax is assignment statement-based:• MyCusts = FILTER Custs BY SalesPerson eq 15;Access Hbase• CpuMetrics = LOAD hbase://SystemMetrics USINGorg.apache.pig.backend.hadoop.hbase.HBaseStorage(cpu:,-loadKey -returnTuple);
    • 58. April 10-12 | Chicago, ILDemoPig
    • 59. Sqoopsqoop import--connect"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"--table <from_table>--target-dir <to_hdfs_folder>--split-by <from_table_column>
    • 60. Sqoopsqoop export--connect"jdbc:sqlserver://<servername>.database.windows.net:1433;database=<dbname>;user=<username>@<servername>;password=<password>"--table <to_table>--export-dir <from_hdfs_folder>--input-fields-terminated-by"<delimiter>"
    • 61. Flume NGSource• Avro (data serialization system – can read json-encoded data files, and canwork over RPC)• Exec (reads from stdout of long-running process)Sinks• HDFS, HBase, AvroChannels• Memory, JDBC, file
    • 62. Flume NG (next generation)Setup conf/flume.conf# Define a memory channel called ch1 on agent1agent1.channels.ch1.type = memory# Define an Avro source called avro-source1 on agent1 and tell it# to bind to 0.0.0.0:41414. Connect it to channel ch1.agent1.sources.avro-source1.channels = ch1agent1.sources.avro-source1.type = avroagent1.sources.avro-source1.bind = 0.0.0.0agent1.sources.avro-source1.port = 41414# Define a logger sink that simply logs all events it receives# and connect it to the other end of the same channel.agent1.sinks.log-sink1.channel = ch1agent1.sinks.log-sink1.type = logger# Finally, now that weve defined all of our components, tell# agent1 which ones we want to activate.agent1.channels = ch1agent1.sources = avro-source1agent1.sinks = log-sink1From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
    • 63. Mahout AlgorithmsRecommendation• Your info + community info• Give users/items/ratings; get user-user/item-item• itemsimilarityClassification/Categorization• Drop into buckets• Naïve Bayes, Complementary Naïve Bayes, Decision ForestsClustering• Like classification, but with categories unknown• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift
    • 64. Workflow, SyntaxWorkflow• Run the job• Dump the output• Visualize, predictmahout algorithm-- input folderspec-- output folderspec-- param1 value1-- param2 value2…Example:• mahout itemsimilarity--input <input-hdfs-path>--output <output-hdfs-path>--tempDir <tmp-hdfs-path>-s SIMILARITY_LOGLIKELIHOOD
    • 65. The Truth About MahoutMahout is really just an algorithm engineIts output is almost unusable by non-statisticians/non-data scientistsYou need a staff or a product to visualize, or make into a usableprediction modelInvestigate Predixion Software• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team• Excel add-in can use Mahout remotely, visualize its output, run predictiveanalyses• Also integrates with SQL Server, Greenplum, MapReduce• http://www.predixionsoftware.com
    • 66. The “Data-Refinery” IdeaUse Hadoop to “on-board” unstructured data, then extract manageablesubsetsLoad the subsets into conventional DW/BI servers and use familiaranalytics tool to examineThis is the current rationalization of Hadoop + BI tools‟ coexistenceWill it stay this way?
    • 67. Dremel-based service for massive amounts of dataPay for query and storageSQL-like query languageHas an Excel connectorGoogle BigQueryL
    • 68. April 10-12 | Chicago, ILGoogle BigQuery
    • 69. April 10-12 | Chicago, ILDrilldown on NoSQL
    • 70. NoSQL Data FodderAddresses PreferencesNotesFriends, FollowersDocuments
    • 71. “Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for involume”• Millions of concurrent usersThink of sites like Amazon or GoogleThink of non-transactional tasks like loadingcatalog data to display product page, orenvironment preferences
    • 72. NoSQL Common TraitsNon-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies
    • 73. More than just the Elephant in the roomOver 120+ types of noSQL databasesSo many NoSQL optionsL
    • 74. ConceptsConsistencyCAP TheoremIndexingQueriesMapReduceSharding
    • 75. ConsistencyCAP Theorem• Databases may only excel at two of the following three attributes:consistency, availability and partition toleranceNoSQL does not offer “ACID” guarantees• Atomicity, consistency, isolation and durabilityInstead offers “eventual consistency”Similar to DNS propagation
    • 76. Things like inventory, account balances should be consistent• Imagine updating a server in Seattle that stock was depleted• Imagine not updating the server in NY• Customer in NY goes to order 50 pieces of the item• Order processed even though no stockThings like catalog information don‟t have to be, at least not immediately• If a new item is entered into the catalog, it‟s OK for some customers to see iteven before the other customers‟ server knows about itBut catalog info must come up quickly• Therefore don‟t lock data in one location while waiting to update the otherTherefore, OK to sacrifice consistency for speed, in some casesConsistency
    • 77. CAP TheoremConsistencyAvailabilityPartitionToleranceRelationalNoSQL
    • 78. IndexingMost NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which isappend-only• Writes are logged• Logged writes are batched• File is re-created and sorted
    • 79. QueriesTypically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…
    • 80. MapReduceThis is not Hadoop‟s MapReduce, but it‟s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB
    • 81. L
    • 82. ShardingA partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recoverySince “shards” can be geographically distributed, sharding can act like aCDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place
    • 83. NoSQL CategoriesGraphWide ColumnDocumentKey/ValueL
    • 84. Key-Value StoresThe most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to rowCommon on cloud platforms• e.g. Amazon SimpleDB, Azure Table StorageMemcacheDB, Voldemort, Couchbase, DynamoDB(AWS), Dynomite, Redis and Riak87
    • 85. Key-Value StoresTable: CustomersRow ID: 101First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502Table: OrdersRow ID: 1501Price: 300 USDItem1: 52134Item2: 24457Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428Database
    • 86. Wide Column StoresHas tables with declared column families• Each column family has “columns” which are KV pairs that can vary from row to rowThese are the most foundational for large sites• BigTable (Google)• HBase (Originally part of Yahoo-dominated Hadoop project)• Cassandra (Facebook)• Calls column families “super columns” and tables “super column families”They are the most “Big Data”-ready• Especially HBase + Hadoop
    • 87. Table: CustomersRow ID: 101Super Column: NameColumn: First_Name:AndrewColumn: Last_Name: BrustSuper Column: AddressColumn: Number: 123Column: Street: Main StreetSuper Column: OrdersColumn: Last_Order: 1501Table: OrdersRow ID: 1501Super Column: PricingColumn: Price: 300USDSuper Column: ItemsColumn: Item1: 52134Column: Item2: 24457Row ID: 1502Super Column: PricingColumn: Price: 2500GBPSuper Column: ItemsColumn: Item1: 98456Column: Item2: 59428Row ID: 202Super Column: NameColumn: First_Name: JaneColumn: Last_Name: DoeSuper Column: AddressColumn: Number: 321Column: Street: Elm StreetSuper Column: OrdersColumn: Last_Order: 1502Wide Column Stores
    • 88. April 10-12 | Chicago, ILDemoWide Column Stores
    • 89. Document StoresHave “databases,” which are akin to tablesHave “documents,” akin to rows• Documents are typically JSON objects• Each document has properties and values• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. containedJSON objects - Allows for hierarchical storage)• Can have attachments as wellOld versions are retained• So Doc Stores work well for content managementSome view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:• CouchDB• Derivatives• MongoDB
    • 90. Document Store Application OrientationDocuments can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON• Documents are JSON objects• CouchDB/MongoDB use JavaScript as native languageIn CouchDB, “view functions” also have unique URIs and they returnHTML• So you can build entire applications in the database
    • 91. Database: CustomersDocument ID: 101First_Name: AndrewLast_Name: BrustAddress:Orders:Database: OrdersDocument ID: 1501Price: 300 USDItem1: 52134Item2: 24457Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428Number: 123Street: Main StreetMost_recent: 1501Document ID: 202First_Name: JaneLast_Name: DoeAddress:Orders:Number: 321Street: Elm StreetMost_recent: 1502Document Stores
    • 92. Comparing…
    • 93. April 10-12 | Chicago, ILDemoDocument Stores
    • 94. Graph DatabasesGreat for social network applications and others where relationships areimportantNodes and edges• Edge like a join• Nodes like rows in a tableNodes can also have properties and valuesNeo4j is a popular graph db
    • 95. DatabaseSent invitationtoCommented onphoto byFriendofAddressPlaced orderItem2Item1Joe Smith JaneDoeAndrew BrustStreet: 123 MainStreetCity: New YorkState: NYZip: 10014ID: 52134Type: DressColor: BlueID: 24457Type: ShirtColor: RedID: 252Total Price: 300USDGeorge WashingtonGraph Databases
    • 96. NoSQL on Windows AzurePlatform as a Service• Cloudant: https://cloudant.com/azure/• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/MongoDB, DIY:• On an Azure Worker Role:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/
    • 97. NoSQL on Windows AzureOthers, DIY (Linux VMs):• Couchbase:http://blog.couchbase.com/couchbase-server-new-windows-azure• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-installer-for-windows-azure• Riak:http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/
    • 98. NoSQL + BINoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”• See next slideWide-column stores and column-oriented databases are similartechnologically
    • 99. NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and HadoopWhy?• Lack of indexing not a problem• Consistency not an issue• Fast reads very important• Distributed file systems important too• Commodity hardware and disk assumptions also important• Not Web scale but massive scale-out, so similar concerns
    • 100. NoSQL CompromisesEventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)
    • 101. Common DBA Tasks in NoSQLRDBMS NoSQLImport Data Import DataSetup Security Setup SecurityPerform a Backup Make a copy of the dataRestore a Database Move a copy to a locationCreate an Index Create an IndexJoin Tables Together Run MapReduceSchedule a Job Schedule a (Cron) JobRun Database Maintenance Monitor space and resources usedSend an Email from SQL Server Set up resource threshold alertsSearch BOL Interpret Documentation104L
    • 102. Which Type of NoSQL forWhich Type of Data?Type of Data Type of NoSQL solution ExampleLog files Wide Column HBaseProduct Catalogs Key Value on disk DynamoDBUser profiles Key Value in memory RedisStartups Document MongoDBSocial media connections Graph Neo4jLOB w/Transactions NONE! Use RDBMS SQL Server105L
    • 103. Relational vs. NoSQLLine of Business -> RelationalLarge, public (consumer)-facing sites -> NoSQLComplex data structures -> RelationalBig Data -> NoSQLTransactional -> RelationalContent Management -> NoSQLEnterprise->RelationalConsumer Web -> NoSQL
    • 104. Data Scientists…L
    • 105. Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problemTry out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training environmentsLearn NoSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET• New query tools (vendor-specific) – Google Refine, AmazonKarmasphere, Microsoft Excel connectors, etc…NoSQL To-Do ListL
    • 106. NoSQL for .NET DevelopersRavenDBMongoDB C#/.NET DriverMongoDB on Windows AzureCouchBase .NET Client LibraryRiak client for .NETAWS Toolkit for Visual StudioGoogle cloud APIs (REST-based)
    • 107. Thank You• andrew.brust@bluebadgeinsights.com• @andrewbrust on twitter• Want to get on Blue Badge Insights‟ list?”Text “bluebadge” to 22828
    • 108. April 10-12 | Chicago, ILThank you!Diamond Sponsor

    ×