• Save
20130617 NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs - New York - Open Analytics Summit
 

20130617 NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs - New York - Open Analytics Summit

on

  • 1,638 views

Big Data use cases over Hadoop have slowly evolved from batch analytics that rely ...

Big Data use cases over Hadoop have slowly evolved from batch analytics that rely

on long running MapReduce jobs to include more real-time analytics that rely on

quick data retrieval through technologies such as NoSQL and Interactive SQL. With

so many solutions available in the market, how should one go about deciding which

technologies to pick? This session helps the audience answer that question.

The session discusses how enterprise architects can take a holistic approach to real-
time Big Data problems and look at NoSQL and SQL technologies as complementary

pieces of the same puzzle. The session will delve into SQL queries over Hadoop and

the various ways Hadoop NoSQL database and SQL work hand in hand to provide

the best real-time results for the business user. Specifically, technologies such as

Apache HBase and Apache Drill will be discussed.

Statistics

Views

Total Views
1,638
Views on SlideShare
1,631
Embed Views
7

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 7

https://twitter.com 7

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).
  • I’m assuming that the typical attendee of this talk is a software developer familiar with and interested in open source technologies. Is already familiar with Hadoop, relational databases, and has heard of or may have some hands-on experience working with some NosQL technologies.
  • If you can gain the interactive analysis point, you’ve also solved (possibly sub-optimally) the batch reporting point
  • Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
  • Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).

20130617 NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs - New York - Open Analytics Summit 20130617 NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs - New York - Open Analytics Summit Presentation Transcript

  • NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs Allen Day MapR Technologies
  • Me • Allen Day – Principal Data Scientist @ MapR – Human Genomics / Bioinformatics (PhD, UCLA School of Medicine) • @allenday • allenday@allenday.com • aday@maprtech.com
  • You • I’m assuming that the typical attendee: – is a software developer – is interested and familiar with open source – is familiar with Hadoop, relational DBs – has heard of or has used some NoSQL technology
  • Big Data Workloads • ETL • Key-value store • Lightweight OLTP • Model creation & clustering & indexing • Classification & Anomaly detection • Web Crawling • Stream processing • Batch reporting • Interactive analysis
  • What is NoSQL? Why use it? • Traditional storage (relational DBs) are unable to accommodate increasing # of observations – Culprits: sensors, event logs, electronic payments • Solution: stay responsive by relaxing storage requirements – Denormalize, loosen schema, loosen consistency • This is the essence of NoSQL
  • NoSQL Impact on Business Processes • Traditional business intelligence technology assumes relational DB storage – Scaling solution is to use MPP (Aster, Greenplum) • However, collected data aren’t in relational DB – Data volume still increasing – Technology still in flux • Decoupled data storage and decision support systems – Very high opportunity cost to business
  • Ideal Solution Features • Scalable & Reliable – Distributed storage – Parallel processing • SQL application support – Ad-hoc, interactive queries – Real-time responsiveness • Flexible – Can accommodate rapid storage and schema evolution – Can accommodate new analytics methods and functions
  • From Ideals to Possibilities • Migrate NoSQL data/processing to SQL – High cost to marshal NoSQL data to SQL storage – SQL systems lack advanced analytics capabilities • Migrate SQL data to NoSQL – Breaks compatibility for legacy business functions, e.g. financial reporting requirements – Limited relational support (joins) & high latency – Technology still in flux • Other Approaches? – Yes. First let’s consider a SQL/NoSQL use case
  • Impala Interactive Queries & Hadoop low-latency
  • Example Problem: Marketing Campaign • Jane is an analyst at an e-commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas… • …and lots of data User profiles Transaction information Access logs
  • Traditional System Solution 1: RDBMS • ETL the data from MongoDB and Hadoop into the RDBMS – MongoDB data must be flattened, schematized, filtered and aggregated – Hadoop data must be filtered and aggregated • Query the data using any SQL-based tool User profiles Access logs Transaction information
  • Traditional System Solution 2: Hadoop • ETL the data from Oracle and MongoDB into Hadoop • Work with the MapReduce team to write custom code to generate the desired analyses User profiles Access logs Transaction information
  • Traditional System Solution 3: Hive • ETL the data from Oracle and MongoDB into Hadoop – MongoDB data must be flattened and schematized • But HiveQL is limited, queries are slow and BI tool support is limited User profiles Access logs Transaction information
  • What Would Google Do? Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested/hierarchical data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
  • How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
  • Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
  • Full SQL (ANSI SQL:2003) • Drill supports SQL (ANSI SQL:2003 standard) – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
  • Nested Data • Nested data is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
  • Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
  • Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • How Does Impala Fit In? Impala Strengths • Beta currently available • Easy install and setup on top of Cloudera • Faster than Hive on some queries • SQL-like query language Questions • Open Source ‘Lite’ • Doesn’t support RDBMS or other NoSQLs (beyond Hadoop/HBase) • Early row materialization increases footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • Drill Status: Alpha Available Q2 • Heavy active development by multiple organizations – Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Beta: Q3
  • Why Apache Drill Will Be Successful Resources • Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community • Development done in the open • Active contributors from multiple companies • Rapidly growing Architecture • Full SQL • New data support • Extensible APIs • Full Columnar Execution • Beyond Hadoop
  • Closing Thoughts • What problems can NoSQL and Drill solve for you? • Where do they fit into your organization? • Which data sources and BI tools are important to you?
  • Me • Allen Day – Principal Data Scientist @ MapR – Human Genomics / Bioinformatics (PhD, UCLA School of Medicine) • @allenday • allenday@allenday.com • aday@maprtech.com