Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Real Time Applications


Published on

Published in: Technology, Business
  • Be the first to comment

Big Data Real Time Applications

  1. 1. Real-Time BigData ApplicationsA Reference Architecturefor Search, Discovery, andAnalyticsJustin MakeigDirector, Product Management MarkLogicJune 13, 2012
  2. 2. Hello, my name is _________§  Director, Product Management§  Focus on APIs, integrations, and tools§  With MarkLogic since 2007§  Former web dev, quant
  3. 3. Agenda§  Characterizing Big Data applications§  Examples today§  Combining analytical and operational§  What’s next?
  4. 4. Who is MarkLogic?§  300 customers, $85 million+ in revenue§  300 employees in San Francisco, New York, London, Tokyo, Austin, Frankfurt, Stockholm§  Founded in 2003§  Funded by Sequoia and Tenaya§  Focus on Media, Government, Financial Services
  5. 5. Big Data Workloads Analytic Operational §  Batch §  Real-time, interactive §  Aggregate §  Highly selective §  Repeatable §  Available §  Secure
  6. 6. Operational DatabasesRDBMS “NoSQL”§  Indexes §  Flexible data model§  Transactions §  Commodity scale out§  Security §  Distributed, fault-§  Enterprise operations tolerant §  Hadoop sink/source What if you could get all of these in one system?
  7. 7. MarkLogic Server§  Enterprise NoSQL database§  Flexible data model§  Scales on commodity hardware (1–1,000 nodes)§  Rich built-in indexes, including full-text, scalar, geo§  ACID transactions§  Enterprise-grade operations
  8. 8. OperationalBig Data
  9. 9. LexisNexis§  $4.2 billion in revenue, $2.6 billion LOB§  5 billion+ documents, millions updates/day§  Real-time search, discovery, analytics§  From 9–12 months to 2 weeks for new products§  Enterprise HA/DR
  10. 10. Top 5 Global Investment Bank§  Real-time transparency across all derivatives§  Predictable scalability§  Simplified architecture, operations§  Mission-critical uptime and performance
  11. 11. US Government Intel Agency§  Crawl of substantial part of the Web§  Evolving enrichment§  Real-time analysis§  Granular security§  Centralized governance§  ½ DBA
  12. 12. Big DataApplications
  13. 13. Unified Data§  Flexible data model reduces need for ETL§  Multiple simultaneous applications§  Single governance model
  14. 14. Enterprise Operations§  Predictable scalability§  Replication and failover§  Backup and recovery§  Instrumentation and monitoring
  15. 15. Continuous Adaptation§  Load data as-is, evolve with requirements§  Add new sources in days, not months§  Transactional updates for accuracy
  16. 16. Iterative Query§  Real-time access§  Multi-faceted queries –  Full text –  Structure, semantics, and relationships –  Scalar values and ranges (date/time, numbers, strings) –  Geospatial§  Alerting
  17. 17. Big Data Application Platform APIs and tools" Visualization" Data Mining" Processing" Metadata" Search" Event Operational Environment Analytic DB Operational Unstructured and EDW" DB" Content" Acquisition, Batch Analytics, and Enrichment" Hadoop Archive"
  18. 18. In practice… BI Tools Applications Stream and Search Event Search Processing Index Stats (SPSS, SAS, R, …) Metadata Analytic DB / Operational Unstructured EDW DB Content Store Batch Analytics Archive (Hadoop MR) (HDFS)
  19. 19. Simplified Architecture BI Tools Applications Stream and Search Event Search Processing Index Stats (SPSS, SAS, R, …) Metadata Analytic DB / Operational Unstructured EDW DB Content Store Batch Analytics Archive (Hadoop MR) (HDFS)
  20. 20. Simplified Architecture BI Tools Applications Stats (SPSS, SAS, R, …) Metadata Analytic DB / EDW Batch Analytics Archive (Hadoop MR) (HDFS)
  21. 21. Simplified Architecture BI Tools Applications Stats (SPSS, SAS, R, …) Metadata Analytic DB / EDW Batch Analytics Archive (Hadoop MR) (HDFS)
  22. 22. CombiningAnalytic andOperational
  23. 23. Use Cases Raw Data Operational Applications ? 1 Intermediate Intelligence MarkLogic 3 + Connector for Hadoop Hadoop Archive 2 Progressive Enhancement
  24. 24. Intermediate IntelligenceSophisticated pre-processing for real-time analytics§  Aggregate, transform, enrich, join, restructure§  Keep everything: Long-tail, cost-effective warm storage in HDFS§  Leverage MapReduce ecosystem for analysis and ETL and refinement
  25. 25. Progressive EnhancementEnhance data incrementally to answer new questions§  Enrich data for search, analytics, and delivery§  Leverage MarkLogic indexes for performance, accuracy§  Leverage the growing Hadoop/Java ecosystem for processing§  Centralized governance, security in MarkLogic
  26. 26. ArchiveAge out data to another storage tier§  Align storage and processing resources with the value of data§  Maintain a complete picture of all data§  Simplified lifecycle management for compliance
  27. 27. Reading Data from MarkLogicQuery for input, read in parallel directly from partitions§  Specify input with a query or expression§  Automatically divide up input for parallel Map§  Each split covers one partitionDocs 01–10 11–18 19–30 31–37 Host 2 Host 1
  28. 28. Writing Data to MarkLogicWrite in parallel directly to partitions§  Auto-discovery of partition topology at job start§  Client-side hashing to distribute writes§  Writes directly to partitions§  Batch update transactions for efficiency Task 1 Task 2 Task 3 Host 2 Host 1
  29. 29. Hortonworks Partnership§  Simplified architecture: Certified MarkLogic distribution of Hadoop using Hortonworks Data Platform (HDP)§  Operational: One-stop production support§  Enterprise-Ready: Best practices and reference architecture
  30. 30. MarkLogic Hadoop Roadmap Today Next Future§  MarkLogic Connector §  Unified distribution and §  Tools and ecosystem for Hadoop support using Hortonworks §  HDFS as storage§  Certification against Data Platform §  Compute platform 0.20.2 §  Reference architectures and best practices
  31. 31. Unified EnterpriseData OperationsContinual IterativeAdaptation Query
  32. 32. Alerting for Real-Time ModelsAlerting allows for real-time match-making§  Generate statistical model of user behavior in Hadoop§  Mark-up documents (or sub-documents) with match criteria§  Combine full-text, geo, and scalar queries for real-time decision-making in MarkLogic§  Scale to billions of documents, trillions of matchesExamples
  33. 33. What about HBase?§  Documents §  Sparse maps§  Load as-is, ad hoc queries §  Model for expected access§  Integrated full-text search §  Typically Lucene/Solr bolt-on§  Built-in scalar, structure, §  Secondary indexes exclusively geo-spatial indexes in middleware§  Multi-document ACID §  Row-level atomicity, strong transactions consistency§  MapReduce source and sink §  MapReduce source and sink§  Scale to 100s of nodes on §  Scale to 100s of nodes on commodity hardware commodity hardware
  34. 34. In practice… Metadata Batch Analytics Archive (Hadoop MR) (HDFS)
  35. 35. Why Hortonworks?§  Leaders within Hadoop Community Contributions to Hadoop Core, 2011§  Delivered every major Hadoop release since 0.1§  Experience managing world’s largest deployment§  Ongoing access to Y!’s 1,000+ users and 40k+ nodes for testing, QA, etc.§  Unify and Enable Hadoop Ecosystem§  100% open-source§  Training and support§  Solutions and reference architectures
  36. 36. Intermediate Intelligence Examples§  ETL for data cleansing, de-duplication, joining with reference data§  Aggregate analysis on user behavior to affect applications
  37. 37. Progressive Enhancements Examples§  Metadata extraction§  Entity enrichment§  Binary processing: facial recognition, audio-to- text§  Summarization: sentiment analysis, classification§  Data cleansing, restructuring, translation
  38. 38. Bulk LoadingParallelize ingestion in MarkLogic for performance§  Stage in HDFS, load in parallel into MarkLogic§  Optionally process using MapReduce 2500   9M  doc    Inges2on  Elapse  Time  (s)   2000   MarkLogic   1500   single  client   1000   MarkLogic  +   Hadoop   500   0   1   2   3   4   Cluster  Size