No sql and sql - open analytics summit


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).
  • I’m assuming that the typical attendee of this talk is a software developer familiar with and interested in open source technologies. Is already familiar with Hadoop, relational databases, and has heard of or may have some hands-on experience working with some NosQL technologies.
  • Note correspondences between offline operation and its online counterpart
  • Call detail records, as we’ve been hearing about in the news around PRISM recently
  • Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
  • Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).
  • No sql and sql - open analytics summit

    1. 1. NoSQL and SQL Work Side-by-Sideto Tackle Real-time Big Data NeedsAllen DayMapR Technologies
    2. 2. Me• Allen Day– Principal Data Scientist @ MapR– Human Genomics / Bioinformatics(PhD, UCLA School of Medicine)• @allenday••
    3. 3. You• I’m assuming that the typical attendee:– is a software developer– is interested and familiar with open source– is familiar with Hadoop, relational DBs– has heard of or has used some NoSQL technology
    4. 4. Big Data Workloads• Offline– ETL– Model creation & clustering & indexing– Web Crawling– Batch reporting• Online– Lightweight OLTP– Classification & anomaly detection– Stream processing– Interactive reportingSQL
    5. 5. What is NoSQL? Why use it?• Traditional storage (relational DBs) are unable toaccommodate increasing # and variety ofobservations– Culprits: sensors, event logs, electronic payments• Solution: stay responsive by relaxing ACID storagerequirements– Denormalize (#)– Loosen schema (variety), loosen consistency• This is the essence of NoSQL
    6. 6. NoSQL Impact on Business Processes• Traditional business intelligence (BI) tech stackassumes relational DB storage– Company decisions depend on this (reports, charts)• NoSQL collected data aren’t in relational DB– Data volume/variety is still increasing– Tech and methods are still in flux• Decoupled data storage and decision supportsystems– BI can’t access freshest, largest data sets– Very high opportunity cost to business
    7. 7. Ideal Solution Features• Scalable & Reliable– Distributed replicated storage– Distributed parallel processing• BI application support– Ad-hoc, interactive queries– Real-time responsiveness• Flexible– Handles rapid storage and schema evolution– Handles new analytics methods and functionsHadoop FSMap/Reduce, YARN{SQL Interface{Extensible for NoSQL,Advanced Analytics{
    8. 8. From Ideals to Possibilities• Migrate NoSQL data/processing to SQL– High cost to marshal NoSQL data to SQL storage– SQL systems lack advanced analytics capabilities• Migrate SQL data to NoSQL– Breaks compatibility for BI-dependent functions, reporting– Limited support for relational operations (joins)• high latency– NoSQL tech is still in flux (continuity)• Other Approaches?– Yes. First let’s consider a SQL/NoSQL use case
    9. 9. ImpalaInteractive Queries & Hadooplow-latency
    10. 10. Example Problem: Marketing Campaign• Jane is an analyst at ane-commerce company• How does she figureout good targetingsegments for the nextmarketing campaign?• She has some ideas……and lots of dataUserprofilesTransactioninformationAccesslogs
    11. 11. Traditional System Solution 1: RDBMS• ETL the data fromMongoDB and Hadoopinto the RDBMS– MongoDB data must beflattened, schematized,filtered and aggregated– Hadoop data must befiltered and aggregated• Query the data usingany SQL-based toolUserprofilesAccesslogsTransactioninformation
    12. 12. Traditional System Solution 2: Hadoop• ETL the data fromOracle and MongoDBinto Hadoop– MongoDB data must beflattened andschematized• Work with theMapReduce team towrite custom code togenerate the desiredanalysesUserprofilesAccesslogsTransactioninformation
    13. 13. Traditional System Solution 3: Hive• ETL the data fromOracle and MongoDBinto Hadoop– MongoDB data must beflattened andschematized• But HiveQL queries areslow and BI toolsupport is limited– Marshaling/CodingUserprofilesAccesslogsTransactioninformation
    14. 14. What Would Google Do?DistributedFile SystemNoSQLInteractiveanalysisBatchprocessingGFS BigTable Dremel MapReduceHDFS HBase ???HadoopMapReduceBuild Apache Drill to provide a true open sourcesolution to interactive analysis of Big Data
    15. 15. Apache Drill Overview• Interactive analysis of Big Data using standardSQL• Fast– Low latency queries– Complement native interfaces andMapReduce/Hive/Pig• Open– Community driven open source project– Under Apache Software Foundation• Modern– Standard ANSI SQL:2003 (select/into)– Nested data support– Schema is optional– Supports RDBMS, Hadoop and NoSQLInteractive queriesData analystReporting100 ms-20 minData miningModelingLarge ETL20 min-20 hrMapReduceHivePigApache Drill
    16. 16. How Does It Work?Drillbit(Coordinator)SQL QueryParserQuery PlannerDrillbit(Executor)Drillbit(Executor)Drillbit(Executor)SELECT * FROMoracle.transactions,mongo.users,hdfs.eventsLIMIT 1Drill ClientTableauDrill ODBC DriverMicro-StrategyCrystalReportsDriver
    17. 17. How Does It Work?• Drillbits run on each node, designed tomaximize data locality• Processing is done outside MapReduceparadigm (but possibly within YARN)• Queries can be fed to any Drillbit• Coordination, query planning, optimization,scheduling, and execution are distributedSELECT * FROMoracle.transactions,mongo.users,hdfs.eventsLIMIT 1
    18. 18. Apache Drill: Key Features• Full ANSI SQL:2003 support– Use any SQL-based tool• Nested data support– Flattening is error-prone and often impossible• Schema-less data source support– Schema can change rapidly and may be record-specific• Extensible– DSLs, UDFs– Custom operators (e.g. k-means clustering)– Well-documented data source & file format APIs
    19. 19. How Does Impala Fit In?Impala Strengths• Beta currently available• Easy install and setup on top ofCloudera• Faster than Hive on some queries• SQL-like query languageQuestions• Open Source ‘Lite’• Lacks RDBMS support• Lacks NoSQL support beyondHBase• Early row materializationincreases footprint and reducesperformance• Limited file format support• Query results must fit in memory!• Rigid schema is required• No support for nested data• SQL-like (not SQL)Many important features are “coming soon”.Architectural foundation is constrained. No community development.
    20. 20. Drill Status: Alpha Available July• Heavy active development by multiple organizations– Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho• Available– Logical plan syntax and interpreter– Reference interpreter• In progress– SQL interpreter– Storage engine implementations for Accumulo, Cassandra, HBase andvarious file formats• Significant community momentum– Over 200 people on the Drill mailing list– Over 200 members of the Bay Area Drill User Group– Drill meetups across the US and Europe• Beta: Q3
    21. 21. Why Apache Drill Will Be SuccessfulResources• Contributors havestrong backgroundsfrom companies likeOracle, IBM Netezza,Informatica, Clustrixand PentahoCommunity• Development done inthe open• Active contributorsfrom multiplecompanies• Rapidly growingArchitecture• Full SQL• New data support• Extensible APIs• Full ColumnarExecution• Beyond HadoopBottom Line: Apache Drill enables NoSQL and SQLWork Side-by-Side to Tackle Real-time Big Data Needs
    22. 22. Me• Allen Day– Principal Data Scientist @ MapR• @allenday••
    24. 24. Full SQL (ANSI SQL:2003)• Drill supports SQL (ANSI SQL:2003 standard)– Correlated subqueries, analytic functions, …– SQL-like is not enough• Use any SQL-based tool with Apache Drill– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …– Standard ODBC and JDBC driversDrill%WorkerDrill%WorkerDriverClientDrillbitSQL%Query%ParserQuery%PlannerDrillbitsDrill%ODBC%DriverTableauMicroStrategyExcelSAP%Crystal%Reports
    25. 25. Nested Data• Nested data is becoming prevalent– JSON, BSON, XML, Protocol Buffers, Avro, etc.– The data source may or may not be aware• MongoDB supports nested data natively• A single HBase value could be a JSON document(compound nested type)– Google Dremel’s innovation was efficient columnarstorage and querying of nested data• Flattening nested data is error-prone and oftenimpossible– Think about repeated and optional fields at everylevel…• Apache Drill supports nested data– Extensions to ANSI SQL:2003enum Gender {MALE, FEMALE}record User {string name;Gender gender;long followers;}{"name": "Homer","gender": "Male","followers": 100children: [{name: "Bart"},{name: "Lisa”}]}JSONAvro
    26. 26. Schema is Optional• Many data sources do not have rigid schemas– Schemas change rapidly– Each record may have a different schema, may be sparse/wide• Apache Drill supports querying against unknown schemas– Query any HBase, Cassandra or MongoDB table• User can define the schema or let the system discover itautomatically– System of record may already have schema information– No need to manage schema evolutionRow Key CF contents CF anchor"com.cnn.www" contents:html = "<html>…" = "" = "CNN""com.foxnews.www" contents:html = "<html>…" = "Fox News"… … …
    27. 27. Flexible and Extensible Architecture• Apache Drill is designed for extensibility• Well-documented APIs and interfaces• Data sources and file formats– Implement a custom scanner to support a new source/format• Query languages– SQL:2003 is the primary language– Implement a custom Parser to support a Domain Specific Language– UDFs• Optimizers– Drill will have a cost-based optimizer– Clear surrounding APIs support easy optimizer exploration• Operators– Custom operators can be implemented (e.g. k-Means clustering)– Operator push-down to data source (RDBMS)