Big Data Real Time Applications

Real-Time Big
Data Applications
A Reference Architecture
for Search, Discovery, and
Analytics

Justin Makeig
Director, Product Management MarkLogic
June 13, 2012

Hello, my name is _________

§  Director, Product Management
§  Focus on APIs, integrations, and tools
§  With MarkLogic since 2007
§  Former web dev, quant

Agenda

§  Characterizing Big Data applications
§  Examples today
§  Combining analytical and operational
§  What’s next?

Who is MarkLogic?

§  300 customers, $85 million+ in revenue
§  300 employees in San Francisco, New York,
London, Tokyo, Austin, Frankfurt, Stockholm
§  Founded in 2003
§  Funded by Sequoia and Tenaya
§  Focus on Media, Government, Financial Services

Big Data Workloads

Analytic Operational

§  Batch §  Real-time, interactive
§  Aggregate §  Highly selective
§  Repeatable §  Available
§  Secure

Operational Databases

RDBMS “NoSQL”
§  Indexes §  Flexible data model
§  Transactions §  Commodity scale out
§  Security §  Distributed, fault-
§  Enterprise operations tolerant
§  Hadoop sink/source

What if you could get all of these in one system?

MarkLogic Server

§  Enterprise NoSQL database
§  Flexible data model
§  Scales on commodity hardware (1–1,000 nodes)
§  Rich built-in indexes, including full-text, scalar, geo
§  ACID transactions
§  Enterprise-grade operations

LexisNexis

§  $4.2 billion in revenue,
$2.6 billion LOB
§  5 billion+ documents,
millions updates/day
§  Real-time search,
discovery, analytics
§  From 9–12 months to
2 weeks for new products
§  Enterprise HA/DR

Top 5 Global Investment Bank

§  Real-time transparency
across all derivatives
§  Predictable scalability
§  Simplified architecture,
operations
§  Mission-critical uptime and
performance
http://www.flickr.com/photos/tenaciousme/1797368175/

US Government Intel Agency

§  Crawl of substantial
part of the Web
§  Evolving enrichment
§  Real-time analysis
§  Granular security
§  Centralized governance
§  ½ DBA
http://www.flickr.com/photos/usarak/4969182481

Unified Data

§  Flexible data model reduces need for ETL
§  Multiple simultaneous applications
§  Single governance model

Enterprise Operations

§  Predictable scalability
§  Replication and failover
§  Backup and recovery
§  Instrumentation and monitoring

Continuous Adaptation

§  Load data as-is, evolve with requirements
§  Add new sources in days, not months
§  Transactional updates for accuracy

Iterative Query

§  Real-time access
§  Multi-faceted queries
–  Full text
–  Structure, semantics, and relationships
–  Scalar values and ranges
(date/time, numbers, strings)
–  Geospatial
§  Alerting

Big Data Application Platform

APIs and tools"

Visualization"

Data Mining"

Processing"

Metadata"

Search"
Event
Operational
Environment
Analytic DB Operational Unstructured
and EDW" DB" Content"

Acquisition, Batch Analytics, and Enrichment"
Hadoop
Archive"

In practice…
BI Tools Applications

Stream and Search
Event
Search
Processing
Index
Stats (SPSS,
SAS, R, …)
Metadata

Analytic DB / Operational Unstructured
EDW DB Content Store

Batch
Analytics Archive
(Hadoop MR) (HDFS)

Simplified Architecture

Stream and Search
Event
Search
Processing
Index
Stats (SPSS,
SAS, R, …)
Metadata

Analytic DB / Operational Unstructured
EDW DB Content Store

Batch
Analytics Archive
(Hadoop MR) (HDFS)

Simplified Architecture

Stats (SPSS,
SAS, R, …)
Metadata

Analytic DB /
EDW

Batch
Analytics Archive
(Hadoop MR) (HDFS)

Combining
Analytic and
Operational

Use Cases

Raw Data Operational
Applications

? 1
Intermediate
Intelligence
MarkLogic
3 + Connector for
Hadoop Hadoop
Archive
2
Progressive
Enhancement

Intermediate Intelligence
Sophisticated pre-processing for real-time analytics
§  Aggregate, transform, enrich, join, restructure
§  Keep everything: Long-tail, cost-effective warm
storage in HDFS
§  Leverage MapReduce ecosystem for analysis and
ETL and refinement

Progressive Enhancement
Enhance data incrementally to answer new questions
§  Enrich data for search, analytics, and delivery
§  Leverage MarkLogic indexes for performance,
accuracy
§  Leverage the growing Hadoop/Java ecosystem
for processing
§  Centralized governance, security in MarkLogic

Archive
Age out data to another storage tier
§  Align storage and processing resources with the
value of data
§  Maintain a complete picture of all data
§  Simplified lifecycle management for compliance

Reading Data from MarkLogic
Query for input, read in parallel directly from partitions
§  Specify input with a query or expression
§  Automatically divide up input for parallel Map
§  Each split covers one partition

Docs 01–10 11–18 19–30 31–37

Host 2
Host 1

Writing Data to MarkLogic
Write in parallel directly to partitions
§  Auto-discovery of partition topology at job start
§  Client-side hashing to distribute writes
§  Writes directly to partitions
§  Batch update transactions for efficiency
Task 1 Task 2 Task 3

Host 2
Host 1

Hortonworks Partnership

§  Simplified architecture: Certified MarkLogic
distribution of Hadoop using Hortonworks Data
Platform (HDP)
§  Operational: One-stop production support
§  Enterprise-Ready: Best practices and
reference architecture

MarkLogic Hadoop Roadmap

Today Next Future
§  MarkLogic Connector §  Unified distribution and §  Tools and ecosystem
for Hadoop support using Hortonworks §  HDFS as storage
§  Certification against Data Platform
§  Compute platform
0.20.2 §  Reference architectures and
best practices

Unified Enterprise
Data Operations

Continual Iterative
Adaptation Query

Alerting for Real-Time Models
Alerting allows for real-time match-making
§  Generate statistical model of user behavior in
Hadoop
§  Mark-up documents (or sub-documents) with
match criteria
§  Combine full-text, geo, and scalar queries for
real-time decision-making in MarkLogic
§  Scale to billions of documents, trillions of
matches

Examples

What about HBase?

§  Documents §  Sparse maps
§  Load as-is, ad hoc queries §  Model for expected access
§  Integrated full-text search §  Typically Lucene/Solr bolt-on
§  Built-in scalar, structure, §  Secondary indexes exclusively
geo-spatial indexes in middleware
§  Multi-document ACID §  Row-level atomicity, strong
transactions consistency
§  MapReduce source and sink §  MapReduce source and sink
§  Scale to 100s of nodes on §  Scale to 100s of nodes on
commodity hardware commodity hardware

In practice…

Metadata

Batch
Analytics Archive
(Hadoop MR) (HDFS)

Why Hortonworks?
§  Leaders within Hadoop
Community Contributions to Hadoop Core, 2011
§  Delivered every major Hadoop
release since 0.1
§  Experience managing world’s
largest deployment
§  Ongoing access to Y!’s 1,000+
users and 40k+ nodes for
testing, QA, etc.
§  Unify and Enable Hadoop
Ecosystem
§  100% open-source
§  Training and support
§  Solutions and reference
architectures

Intermediate Intelligence Examples

§  ETL for data cleansing, de-duplication, joining
with reference data
§  Aggregate analysis on user behavior to affect
applications

Progressive Enhancements Examples

§  Metadata extraction
§  Entity enrichment
§  Binary processing: facial recognition, audio-to-
text
§  Summarization: sentiment analysis, classification
§  Data cleansing, restructuring, translation

Bulk Loading
Parallelize ingestion in MarkLogic for performance
§  Stage in HDFS, load in parallel into MarkLogic
§  Optionally process using MapReduce
2500

9M
doc

Inges2on
Elapse
Time
(s)

2000

MarkLogic

1500
single
client

1000

MarkLogic
+

Hadoop

500

0

1
2
3
4

Cluster
Size

Big Data Real Time Applications

More Related Content

What's hot

Viewers also liked

Similar to Big Data Real Time Applications

More from DataWorks Summit

Recently uploaded

Big Data Real Time Applications