Hw09 Cross Data Center Logs Processing

•Download as PPT, PDF•

5 likes•767 views

This document summarizes Rackspace's use of Hadoop to process and query logs from multiple datacenters. Key points: - Rackspace needed to query logs from mail/app servers to answer support and analytics questions. Previous solutions using single databases could not scale across datacenters. - Hadoop allowed ingesting raw logs, building Lucene indexes for querying, and storing data across multiple datacenters. Real-time queries used Solr, batch queries used MapReduce. - Implementation collected logs into Hadoop, used SolrOutputFormat to generate indexes, and queried via distributed Solr and MapReduce. This provided scalable storage, analysis, and querying across datacenters.

Technology

Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing

Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Use Case: Background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Use Case: Log Types ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Use Case: Querying ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Previous Solutions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The Hadoop Solution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The Hadoop Solution: Alternatives ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Implementation: Collection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Implementation: Indexing/Querying ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Implementation: Timeframe ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Advantages: Storage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Advantages: Analysis ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pig Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Advantages: Scalability, Cost, Community ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

What's hot

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Databricks

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Managing ADLS gen2 using Apache SparkDatabricks

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit

Spark,Hadoop,Presto ComparitionSandish Kumar H N

Announcing Databricks Cloud (Spark Summit 2014)Databricks

Spark, spark streaming & tachyonJohan hong

Producing Spark on YARN for ETLDataWorks Summit/Hadoop Summit

Top 5 mistakes when writing Streaming applicationshadooparchbook

Apache Spark & HadoopMapR Technologies

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks

Apache Spark BriefingThomas W. Dinsmore

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit

Spark streaming State of the Union - Strata San Jose 2015Databricks

Lessons from Running Large Scale Spark WorkloadsDatabricks

SSR: Structured Streaming for R and Machine Learningfelixcss

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Distributed Deep Learning on Hadoop ClustersDataWorks Summit/Hadoop Summit

What's hot (20)

A look under the hood at Apache Spark's API and engine evolutions

Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Managing ADLS gen2 using Apache Spark

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...

Spark,Hadoop,Presto Comparition

Announcing Databricks Cloud (Spark Summit 2014)

Spark, spark streaming & tachyon

Producing Spark on YARN for ETL

Top 5 mistakes when writing Streaming applications

Apache Spark & Hadoop

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Bring Satellite and Drone Imagery into your Data Science Workflows

Apache Spark Briefing

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...

Spark streaming State of the Union - Strata San Jose 2015

Lessons from Running Large Scale Spark Workloads

SSR: Structured Streaming for R and Machine Learning

Introduction to Apache Spark Developer Training

Distributed Deep Learning on Hadoop Clusters

Viewers also liked

Hw09 Optimizing Hadoop DeploymentsCloudera, Inc.

Hw09 Fingerpointing Sourcing Performance IssuesCloudera, Inc.

Hw09 Matchmaking In The CloudCloudera, Inc.

Hw09 Analytics And ReportingCloudera, Inc.

Hadoop PuzzlersCloudera, Inc.

Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.

Viewers also liked (7)

Hw09 Optimizing Hadoop Deployments

Hw09 Fingerpointing Sourcing Performance Issues

Hw09 Matchmaking In The Cloud

Hw09 Analytics And Reporting

Hadoop Puzzlers

Doug Cutting on the State of the Hadoop Ecosystem

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...

Similar to Hw09 Cross Data Center Logs Processing

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza

Hadoop ppt1chariorienit

Hoodie - DataEngConf 2017Vinoth Chandar

Hadoop PrimerSteve Staso

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Introduction to Apache HadoopChristopher Pezza

Get involved with the Apache Software FoundationShalin Shekhar Mangar

Hive parisSzehon Ho

How can Hadoop & SAP be integratedDouglas Bernardini

Google Data Engineering.pdfavenkatram

Data Engineering on GCPBlibBlobb

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling

Big Data and HadoopFlavio Vit

Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN

Similar to Hw09 Cross Data Center Logs Processing (20)

Hadoop ecosystem framework n hadoop in live environment

Big Data Analytics with Hadoop, MongoDB and SQL Server

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...

Hadoop ppt1

Hoodie - DataEngConf 2017

Hadoop Primer

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Introduction to Apache Hadoop

Get involved with the Apache Software Foundation

Hive paris

How can Hadoop & SAP be integrated

Google Data Engineering.pdf

Data Engineering on GCP

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

Hopsworks in the cloud Berlin Buzzwords 2019

Big Data and Hadoop

Basic Hadoop Architecture V1 vs V2

Recently uploaded

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

CloudStudio User manual (basic edition):comworks

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Search Engine Optimization SEO PDF for 2024.pdfRankYa

"ML in Production",Oleksandr BaganFwdays

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

AI as an Interface for Commercial BuildingsMemoori

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms

CloudStudio User manual (basic edition):

Unraveling Multimodality with Large Language Models.pdf

SIP trunking in Janus @ Kamailio World 2024

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Search Engine Optimization SEO PDF for 2024.pdf

"ML in Production",Oleksandr Bagan

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Advanced Test Driven-Development @ php[tek] 2024

Designing IA for AI - Information Architecture Conference 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

Scanning the Internet for External Cloud Exposures via SSL Certs

Connect Wave/ connectwave Pitch Deck Presentation

Powerpoint exploring the locations used in television show Time Clash

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

AI as an Interface for Commercial Buildings

Vertex AI Gemini Prompt Engineering Tips

Streamlining Python Development: A Guide to a Modern Project Setup

My Hashitalk Indonesia April 2024 Presentation