• Save
Finding the Needle in a Big Data Haystack
Upcoming SlideShare
Loading in...5
×
 

Finding the Needle in a Big Data Haystack

on

  • 437 views

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1jHRCTd. ...

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1jHRCTd.

In this solutions track talk, sponsored by Cloudera, Eva Andreasson discusses how search and Hadoop can help with some of the industry's biggest challenges. She introduces the data hub concept. Filmed at qconlondon.com.

Eva Andreasson has been working with Java virtual machine technologies, SOA, Cloud, and other enterprise middleware solutions for the past 10 years. She joined Cloudera in 2012. Recently launched Cloudera Search - a new real-time search workload natively integrated with the Hadoop stack - and drove the efforts that lead to 10x user uptake of Hue - the UI of the Hadoop ecosystem.

Statistics

Views

Total Views
437
Views on SlideShare
437
Embed Views
0

Actions

Likes
3
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Finding the Needle in a Big Data Haystack Finding the Needle in a Big Data Haystack Presentation Transcript

  • SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera
  • InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /search-hadoop-data-hub
  • Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • Agenda • Problem (Solving) • Apache Solr + Apache Hadoop et al • Real-world examples • Q&A
  • Problem Solving Does it Work? Will You Get in Trouble
  • Information Driven Problem Solving • Ask a Question • Find All Relevant Data to Serve the Question • Process the Data to Answer the Question
  • Information Driven Businesses
  • Thousands of employees & lots of data Difficult to access Heterogeneous legacy IT Infrastructure Difficult to manage Hard to scale Silos of multi- structured data Difficult to Integrate Holds copies of data Problem: Finding the Data (Needle) Across (Hay) Silos ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources Data Archives EDWs Marts SearchServers Document Stores Storage ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited.
  • Independent of audience, topic, or tool, data is accessible Unified data storage, processing, management and security Ingest all data any type or any scale Eliminate copies, simplify aggregation and correlation EDWs Marts Storage Search Solution: The Enterprise Data Hub (EDH) Servers Documents ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources EDH Archives ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited. EDH
  • Apache Hadoop et al at the Core • Open source • white box, best innovation at all times • Flexible • Scalable • ingest, storage, processing • Cost efficient
  • But an EDH also Needs… • Security & audit • Manageability, Visibility and Resource Control • Open architecture • Multi-workload support and optimization
  • Problem solved! We can go home…?
  • Problem: Finding the Needle in a Big Data Haystack?
  • New Audiences, New Challenges • Non-technical staff needs access to data • Same data used in bigger processes • Speed up manual introspection • Technical staff needs to view / explore data • View interim results • Drill down into mission critical data • Explore data to design models • Cross-workload needs • Combine structured and unstructured data
  • Solution: Everyone Knows Search! Explore Navigate Correlate
  • 15 Cloudera Search: How We Integrated Solr with Hadoop et al
  • Cloudera Search Interactive search for Hadoop • Full-text and faceted navigation • Batch, near real-time, and on-demand indexing 16 Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • Scalable and Robust Index Storage HDFS Lucene Extraction Mapping Solr Zookeeper SolrCloud Querying API Indexing API 17 Solr and HDFS • Scalable, cost-efficient index storage • Higher availability • Search and process data in one platform
  • Near Real Time Indexing at Ingest Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest • Document-level ACL HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 18
  • Scalable Batch Indexing Files Files SolrCloud Cluster 19 HDFS Solr and MapReduce • Flexible, scalable batch indexing • GOLIVE: Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing Index shard Index shard MR Indexer MR Indexer
  • Scalable Batch Indexing 20 Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • HBase Secondary Indexes, in Real-Time interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server HDFS HBase Solr and HBase • Secondary Indexes made easy and flexible
  • HBase Secondary Indexes, or in Batch interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server MR Indexer HDFS MR Indexer HBase Index shard MR Indexer Solr and HBase • Real Time or Batch
  • Streamlined Extraction and Mapping Morphlines • Simple and flexible data transformation • Reusable across multiple index (and other) workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document
  • Security Cloudera Search + Sentry • Cluster level access control • Index level access control
  • Simple, Customizable Search UI Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • Simplified Management Cloudera Manager • Install, configure, deploy SolrCloud on the cluster • Centralized management and monitoring – cross workloads • Unified resource management and control
  • Data End User Client App (e.g. Hue) Flume HDFS Raw, filtered, or annotaded data SolrCloud Cluster(s) Data to be indexed Indexed data MapReduce Batch Indexing GoLive updates HBase Cluster Replication Events to be indexed Data ClouderaManager Search queries Architecture Overview Sentry Authorization
  • Some Real-World Examples • Image processing and data correlation • Quick result checks on new algorithms • Correlate image and device logs or external data • Image exploration as a service • Expedite data modeling processes • Free-text form matching • Claims report correlation, to gather data likely to be similar • Serve 360-client or patient records in a speedier way • Fraud / pattern extraction • Log management • Real-time drill down • Long term trending and capacity planning • Anomaly detection over larger sets of data
  • The EDH - Information-Driven Problem Solving Made Easy!
  • Integration is Key • Eliminate data moves or copies, break silos • Create a truly active archive • Serve non-technical audiences on the same platform as where advanced analytics workloads run • Analyze and combine structured and non-structured data • Expedite exploration of various data types • Find data and do something with it – where it is stored • Future proof your data management system
  • Learn More • Cloudera.com • Read our blog • Take our online training (or get Cloudera certified) • Download whitepapers • View webinars • Talk to our customers • Follow/contact me • @EvaAndreasson
  • Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/search- hadoop-data-hub