Real-Time Big
Data Applications
A Reference Architecture
for Search, Discovery, and
Analytics

Justin Makeig
Director, Product Management MarkLogic
June 13, 2012
Hello, my name is _________

§    Director, Product Management
§    Focus on APIs, integrations, and tools
§    With MarkLogic since 2007
§    Former web dev, quant
Agenda

§    Characterizing Big Data applications
§    Examples today
§    Combining analytical and operational
§    What’s next?
Who is MarkLogic?

§  300 customers, $85 million+ in revenue
§  300 employees in San Francisco, New York,
    London, Tokyo, Austin, Frankfurt, Stockholm
§  Founded in 2003
§  Funded by Sequoia and Tenaya
§  Focus on Media, Government, Financial Services
Big Data Workloads

  Analytic           Operational

  §  Batch          §    Real-time, interactive
  §  Aggregate      §    Highly selective
  §  Repeatable     §    Available
                     §    Secure
Operational Databases

RDBMS                       “NoSQL”
§  Indexes                 §  Flexible data model
§  Transactions            §  Commodity scale out
§  Security                §  Distributed, fault-
§  Enterprise operations       tolerant
                            §  Hadoop sink/source


 What if you could get all of these in one system?
MarkLogic Server

§    Enterprise NoSQL database
§    Flexible data model
§    Scales on commodity hardware (1–1,000 nodes)
§    Rich built-in indexes, including full-text, scalar, geo
§    ACID transactions
§    Enterprise-grade operations
Operational
Big Data
LexisNexis

§  $4.2 billion in revenue,
    $2.6 billion LOB
§  5 billion+ documents,
    millions updates/day
§  Real-time search,
    discovery, analytics
§  From 9–12 months to
    2 weeks for new products
§  Enterprise HA/DR
Top 5 Global Investment Bank

§  Real-time transparency
    across all derivatives
§  Predictable scalability
§  Simplified architecture,
    operations
§  Mission-critical uptime and
    performance
                                  http://www.flickr.com/photos/tenaciousme/1797368175/
US Government Intel Agency

§  Crawl of substantial
    part of the Web
§  Evolving enrichment
§  Real-time analysis
§  Granular security
§  Centralized governance
§  ½ DBA
                             http://www.flickr.com/photos/usarak/4969182481
Big Data
Applications
Unified Data

§  Flexible data model reduces need for ETL
§  Multiple simultaneous applications
§  Single governance model
Enterprise Operations

§    Predictable scalability
§    Replication and failover
§    Backup and recovery
§    Instrumentation and monitoring
Continuous Adaptation

§  Load data as-is, evolve with requirements
§  Add new sources in days, not months
§  Transactional updates for accuracy
Iterative Query

§  Real-time access
§  Multi-faceted queries
   –  Full text
   –  Structure, semantics, and relationships
   –  Scalar values and ranges
      (date/time, numbers, strings)
   –  Geospatial
§  Alerting
Big Data Application Platform

                                        APIs and tools"


        Visualization"


                         Data Mining"




                                          Processing"



                                                        Metadata"



                                                                           Search"
                                            Event
                                                                                     Operational
                                                                                     Environment
        Analytic DB                      Operational                Unstructured
         and EDW"                           DB"                       Content"

        Acquisition, Batch Analytics, and Enrichment"
                                                                                     Hadoop
                                            Archive"
In practice…
               BI Tools             Applications




                      Stream and                     Search
                         Event
                                                              Search
                      Processing
                                                              Index
      Stats (SPSS,
       SAS, R, …)
                                        Metadata



      Analytic DB /       Operational               Unstructured
         EDW                 DB                     Content Store




                  Batch
                 Analytics                Archive
               (Hadoop MR)                (HDFS)
Simplified Architecture
               BI Tools             Applications




                      Stream and                     Search
                         Event
                                                              Search
                      Processing
                                                              Index
       Stats (SPSS,
        SAS, R, …)
                                        Metadata



      Analytic DB /       Operational               Unstructured
         EDW                 DB                     Content Store




                  Batch
                 Analytics                Archive
               (Hadoop MR)                (HDFS)
Simplified Architecture
               BI Tools      Applications




       Stats (SPSS,
        SAS, R, …)
                               Metadata



      Analytic DB /
         EDW




                  Batch
                 Analytics        Archive
               (Hadoop MR)        (HDFS)
Simplified Architecture
               BI Tools      Applications




       Stats (SPSS,
        SAS, R, …)
                               Metadata



      Analytic DB /
         EDW




                  Batch
                 Analytics        Archive
               (Hadoop MR)        (HDFS)
Combining
Analytic and
Operational
Use Cases

    Raw Data                           Operational
                                       Applications

               ?        1
                   Intermediate
                    Intelligence
                                                        MarkLogic
                        3                             + Connector for
    Hadoop                                               Hadoop
                     Archive
                                   2
                                        Progressive
                                       Enhancement
Intermediate Intelligence
Sophisticated pre-processing for real-time analytics
§  Aggregate, transform, enrich, join, restructure
§  Keep everything: Long-tail, cost-effective warm
    storage in HDFS
§  Leverage MapReduce ecosystem for analysis and
    ETL and refinement
Progressive Enhancement
Enhance data incrementally to answer new questions
§  Enrich data for search, analytics, and delivery
§  Leverage MarkLogic indexes for performance,
    accuracy
§  Leverage the growing Hadoop/Java ecosystem
    for processing
§  Centralized governance, security in MarkLogic
Archive
Age out data to another storage tier
§  Align storage and processing resources with the
    value of data
§  Maintain a complete picture of all data
§  Simplified lifecycle management for compliance
Reading Data from MarkLogic
Query for input, read in parallel directly from partitions
§  Specify input with a query or expression
§  Automatically divide up input for parallel Map
§  Each split covers one partition


Docs       01–10     11–18                19–30    31–37


                                 Host 2
  Host 1
Writing Data to MarkLogic
Write in parallel directly to partitions
§    Auto-discovery of partition topology at job start
§    Client-side hashing to distribute writes
§    Writes directly to partitions
§    Batch update transactions for efficiency
           Task 1           Task 2          Task 3

                                Host 2
 Host 1
Hortonworks Partnership

§  Simplified architecture: Certified MarkLogic
    distribution of Hadoop using Hortonworks Data
    Platform (HDP)
§  Operational: One-stop production support
§  Enterprise-Ready: Best practices and
    reference architecture
MarkLogic Hadoop Roadmap


           Today                         Next                       Future
§  MarkLogic Connector     §  Unified distribution and    §  Tools and ecosystem
    for Hadoop                  support using Hortonworks §  HDFS as storage
§  Certification against       Data Platform
                                                            §  Compute platform
    0.20.2                  §  Reference architectures and
                                best practices
Unified      Enterprise
Data         Operations



Continual    Iterative
Adaptation   Query
Alerting for Real-Time Models
Alerting allows for real-time match-making
§  Generate statistical model of user behavior in
    Hadoop
§  Mark-up documents (or sub-documents) with
    match criteria
§  Combine full-text, geo, and scalar queries for
    real-time decision-making in MarkLogic
§  Scale to billions of documents, trillions of
    matches

Examples
What about HBase?

§  Documents                   §    Sparse maps
§  Load as-is, ad hoc queries §     Model for expected access
§  Integrated full-text search §    Typically Lucene/Solr bolt-on
§  Built-in scalar, structure, §    Secondary indexes exclusively
    geo-spatial indexes               in middleware
§  Multi-document ACID         §    Row-level atomicity, strong
    transactions                      consistency
§  MapReduce source and sink §      MapReduce source and sink
§  Scale to 100s of nodes on §      Scale to 100s of nodes on
    commodity hardware                commodity hardware
In practice…




                       Metadata




            Batch
           Analytics     Archive
         (Hadoop MR)     (HDFS)
Why Hortonworks?
§  Leaders within Hadoop
    Community                       Contributions to Hadoop Core, 2011
§  Delivered every major Hadoop
    release since 0.1
§  Experience managing world’s
    largest deployment
§  Ongoing access to Y!’s 1,000+
    users and 40k+ nodes for
    testing, QA, etc.
§  Unify and Enable Hadoop
    Ecosystem
§  100% open-source
§  Training and support
§  Solutions and reference
    architectures
Intermediate Intelligence Examples

§  ETL for data cleansing, de-duplication, joining
    with reference data
§  Aggregate analysis on user behavior to affect
    applications
Progressive Enhancements Examples

§  Metadata extraction
§  Entity enrichment
§  Binary processing: facial recognition, audio-to-
    text
§  Summarization: sentiment analysis, classification
§  Data cleansing, restructuring, translation
Bulk Loading
Parallelize ingestion in MarkLogic for performance
§  Stage in HDFS, load in parallel into MarkLogic
§  Optionally process using MapReduce
                                                             2500	
  
    9M	
  doc	
  	
  Inges2on	
  Elapse	
  Time	
  (s)	
  




                                                             2000	
  
                                                                                                                                MarkLogic	
  
                                                             1500	
                                                             single	
  client	
  

                                                             1000	
  
                                                                                                                                MarkLogic	
  +	
  
                                                                                                                                Hadoop	
  
                                                              500	
  


                                                                  0	
  
                                                                          1	
     2	
                           3	
     4	
  
                                                                                          Cluster	
  Size	
  

Big Data Real Time Applications

  • 1.
    Real-Time Big Data Applications AReference Architecture for Search, Discovery, and Analytics Justin Makeig Director, Product Management MarkLogic June 13, 2012
  • 2.
    Hello, my nameis _________ §  Director, Product Management §  Focus on APIs, integrations, and tools §  With MarkLogic since 2007 §  Former web dev, quant
  • 3.
    Agenda §  Characterizing Big Data applications §  Examples today §  Combining analytical and operational §  What’s next?
  • 4.
    Who is MarkLogic? § 300 customers, $85 million+ in revenue §  300 employees in San Francisco, New York, London, Tokyo, Austin, Frankfurt, Stockholm §  Founded in 2003 §  Funded by Sequoia and Tenaya §  Focus on Media, Government, Financial Services
  • 5.
    Big Data Workloads Analytic Operational §  Batch §  Real-time, interactive §  Aggregate §  Highly selective §  Repeatable §  Available §  Secure
  • 6.
    Operational Databases RDBMS “NoSQL” §  Indexes §  Flexible data model §  Transactions §  Commodity scale out §  Security §  Distributed, fault- §  Enterprise operations tolerant §  Hadoop sink/source What if you could get all of these in one system?
  • 7.
    MarkLogic Server §  Enterprise NoSQL database §  Flexible data model §  Scales on commodity hardware (1–1,000 nodes) §  Rich built-in indexes, including full-text, scalar, geo §  ACID transactions §  Enterprise-grade operations
  • 8.
  • 9.
    LexisNexis §  $4.2 billionin revenue, $2.6 billion LOB §  5 billion+ documents, millions updates/day §  Real-time search, discovery, analytics §  From 9–12 months to 2 weeks for new products §  Enterprise HA/DR
  • 10.
    Top 5 GlobalInvestment Bank §  Real-time transparency across all derivatives §  Predictable scalability §  Simplified architecture, operations §  Mission-critical uptime and performance http://www.flickr.com/photos/tenaciousme/1797368175/
  • 11.
    US Government IntelAgency §  Crawl of substantial part of the Web §  Evolving enrichment §  Real-time analysis §  Granular security §  Centralized governance §  ½ DBA http://www.flickr.com/photos/usarak/4969182481
  • 12.
  • 13.
    Unified Data §  Flexibledata model reduces need for ETL §  Multiple simultaneous applications §  Single governance model
  • 14.
    Enterprise Operations §  Predictable scalability §  Replication and failover §  Backup and recovery §  Instrumentation and monitoring
  • 15.
    Continuous Adaptation §  Loaddata as-is, evolve with requirements §  Add new sources in days, not months §  Transactional updates for accuracy
  • 16.
    Iterative Query §  Real-timeaccess §  Multi-faceted queries –  Full text –  Structure, semantics, and relationships –  Scalar values and ranges (date/time, numbers, strings) –  Geospatial §  Alerting
  • 17.
    Big Data ApplicationPlatform APIs and tools" Visualization" Data Mining" Processing" Metadata" Search" Event Operational Environment Analytic DB Operational Unstructured and EDW" DB" Content" Acquisition, Batch Analytics, and Enrichment" Hadoop Archive"
  • 18.
    In practice… BI Tools Applications Stream and Search Event Search Processing Index Stats (SPSS, SAS, R, …) Metadata Analytic DB / Operational Unstructured EDW DB Content Store Batch Analytics Archive (Hadoop MR) (HDFS)
  • 19.
    Simplified Architecture BI Tools Applications Stream and Search Event Search Processing Index Stats (SPSS, SAS, R, …) Metadata Analytic DB / Operational Unstructured EDW DB Content Store Batch Analytics Archive (Hadoop MR) (HDFS)
  • 20.
    Simplified Architecture BI Tools Applications Stats (SPSS, SAS, R, …) Metadata Analytic DB / EDW Batch Analytics Archive (Hadoop MR) (HDFS)
  • 21.
    Simplified Architecture BI Tools Applications Stats (SPSS, SAS, R, …) Metadata Analytic DB / EDW Batch Analytics Archive (Hadoop MR) (HDFS)
  • 22.
  • 23.
    Use Cases Raw Data Operational Applications ? 1 Intermediate Intelligence MarkLogic 3 + Connector for Hadoop Hadoop Archive 2 Progressive Enhancement
  • 24.
    Intermediate Intelligence Sophisticated pre-processingfor real-time analytics §  Aggregate, transform, enrich, join, restructure §  Keep everything: Long-tail, cost-effective warm storage in HDFS §  Leverage MapReduce ecosystem for analysis and ETL and refinement
  • 25.
    Progressive Enhancement Enhance dataincrementally to answer new questions §  Enrich data for search, analytics, and delivery §  Leverage MarkLogic indexes for performance, accuracy §  Leverage the growing Hadoop/Java ecosystem for processing §  Centralized governance, security in MarkLogic
  • 26.
    Archive Age out datato another storage tier §  Align storage and processing resources with the value of data §  Maintain a complete picture of all data §  Simplified lifecycle management for compliance
  • 27.
    Reading Data fromMarkLogic Query for input, read in parallel directly from partitions §  Specify input with a query or expression §  Automatically divide up input for parallel Map §  Each split covers one partition Docs 01–10 11–18 19–30 31–37 Host 2 Host 1
  • 28.
    Writing Data toMarkLogic Write in parallel directly to partitions §  Auto-discovery of partition topology at job start §  Client-side hashing to distribute writes §  Writes directly to partitions §  Batch update transactions for efficiency Task 1 Task 2 Task 3 Host 2 Host 1
  • 29.
    Hortonworks Partnership §  Simplifiedarchitecture: Certified MarkLogic distribution of Hadoop using Hortonworks Data Platform (HDP) §  Operational: One-stop production support §  Enterprise-Ready: Best practices and reference architecture
  • 30.
    MarkLogic Hadoop Roadmap Today Next Future §  MarkLogic Connector §  Unified distribution and §  Tools and ecosystem for Hadoop support using Hortonworks §  HDFS as storage §  Certification against Data Platform §  Compute platform 0.20.2 §  Reference architectures and best practices
  • 31.
    Unified Enterprise Data Operations Continual Iterative Adaptation Query
  • 33.
    Alerting for Real-TimeModels Alerting allows for real-time match-making §  Generate statistical model of user behavior in Hadoop §  Mark-up documents (or sub-documents) with match criteria §  Combine full-text, geo, and scalar queries for real-time decision-making in MarkLogic §  Scale to billions of documents, trillions of matches Examples
  • 34.
    What about HBase? § Documents §  Sparse maps §  Load as-is, ad hoc queries §  Model for expected access §  Integrated full-text search §  Typically Lucene/Solr bolt-on §  Built-in scalar, structure, §  Secondary indexes exclusively geo-spatial indexes in middleware §  Multi-document ACID §  Row-level atomicity, strong transactions consistency §  MapReduce source and sink §  MapReduce source and sink §  Scale to 100s of nodes on §  Scale to 100s of nodes on commodity hardware commodity hardware
  • 35.
    In practice… Metadata Batch Analytics Archive (Hadoop MR) (HDFS)
  • 36.
    Why Hortonworks? §  Leaderswithin Hadoop Community Contributions to Hadoop Core, 2011 §  Delivered every major Hadoop release since 0.1 §  Experience managing world’s largest deployment §  Ongoing access to Y!’s 1,000+ users and 40k+ nodes for testing, QA, etc. §  Unify and Enable Hadoop Ecosystem §  100% open-source §  Training and support §  Solutions and reference architectures
  • 37.
    Intermediate Intelligence Examples § ETL for data cleansing, de-duplication, joining with reference data §  Aggregate analysis on user behavior to affect applications
  • 38.
    Progressive Enhancements Examples § Metadata extraction §  Entity enrichment §  Binary processing: facial recognition, audio-to- text §  Summarization: sentiment analysis, classification §  Data cleansing, restructuring, translation
  • 39.
    Bulk Loading Parallelize ingestionin MarkLogic for performance §  Stage in HDFS, load in parallel into MarkLogic §  Optionally process using MapReduce 2500   9M  doc    Inges2on  Elapse  Time  (s)   2000   MarkLogic   1500   single  client   1000   MarkLogic  +   Hadoop   500   0   1   2   3   4   Cluster  Size