Processing Big Data

Offline Processing with
Hadoop

Chris K Wensel
Concurrent, Inc.

Introduction
Chris K Wensel
chris@wensel.net

• Cascading, Lead Developer
• http://cascading.org/
• Concurrent, Inc., Founder
• Hadoop/Cascading support and tools
• http://concurrentinc.com/

Computing Systems

data info

value

• Exist to create value out of data
• Everything else is an implementation
detail

In Todays Computing
Environment

• Lots of relevant medium-large data sets
– that individually could ﬁt in a RDBMS
• Lots of applications touching that data
– where do you think PERL came from?
• Underutilized hardware owning
(intermediate) data
– xen/vmware add complexity (sprawl)

continued...
• Raw data continuously arriving (and in
bursts)
– we mostly care about the new stuff
• Raw data is dirty
– bots and bugs
• Demands on timely/predictable result
availability
– downstream systems must be fed
• The ‘Cloud’ is enabling an on-demand
model

Data Warehousing != Data
ETL
Processing

process streams
hub and spoke [distributed]
[monolithic]

• Data Warehousing
– monolithic systems and data schema
– distribution through manual federation/
sharding
• Data Processing
– cluster of peer systems
– dynamic even distribution of data and
processing

Data Warehousing
data
raw data ETL warehouse ETL reporting
loggers [BI, KPI, etc]
loggers [cache]
loggers
ETL
ETL
data
mining
product Consumer

R, SAS, some data
Excel, etc
Analyst

• Agility, no “one size ﬁts all” schema,
resistant to change
• Complex Analytics, cannot be represented
by SQL
• Massive Data Sets, won’t ﬁt or too

Production Data Processing
raw data data processing valuable
loggers data
loggers
loggers
Consumer

• Online / Real-Time process

– low latency (milliseconds to seconds for
results)
– smaller datasets - streams
• Offline / Batch
– high latency (minutes to days for results)
– larger datasets - ﬁles

Hadoop Adoption
Cluster

Rack Rack Rack

Node Node Node Node ...

Global Compute-space

Global Namespace

• Distributed replicated storage for large
ﬁles
• Distributed fault tolerant exec of batch
processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis

But Stuffed into Legacy Roles
data
mining
data warehouse
raw data ETL
loggers Hadoop + pig / hive
loggers
loggers
ETL
Analyst

• Hadoop deployments mirror legacy
architectures
– ETL into cached “structured storage”
• Pig/Hive are syntaxes for Data Mining
“Big” data
– SQL like, but hard to customize and not
“advanced”

Hadoop for Data Processing
Value Creation

Scalability

Simplicity

• More Value through Innovation
• Scalability, Not Performance
• Simpliﬁes Infrastructure

Simplicity
Cluster

Rack Rack Rack


cpus Global Compute-space

disks Global Namespace

• Virtualization across resources, not
within (PaaS)
– A single FileSystem across disks - no DBA
– A single Execution System across CPUs -
less IT

Scalability
Users Cluster

Client

Rack Rack Rack


Client
job
job
job
Client

• Scalability - continued reliability and met
expectations as demand changes
• Application Scalability - data grows, app/
infra expand
• Organizational Scalability - simpler infra

Creating Value
events

reporting
raw data
loggers
loggers data processing
loggers Hadoop
+ Hadoop
etlCascading
analytics
Cascading
Producer Consumer

product

operational

Value

• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind

Consequences
• Improved reliability of production
processes
– “we had a failed disk yet jobs never
failed”
• Greater utilization of hardware
resources
– dynamically moves code to available
cores
• Increased rate of innovation
– diverse analytics over larger sets, less
bureaucracy
• Fewer staff

Hadoop MapReduce
Count Job Sort Job
[ k, [v] ] [ k, [v] ]
Map Reduce Map Reduce

[ k, v ] [ k, v ] [ k, v ] [ k, v ]

File File File

[ k, v ] = key and value pair
[ k, [v] ] = key and associated values collection

• Nearly impossible to “think in”
• Apps are many dependent MR jobs

Cascading
Word Count/Sort Flow
Map Reduce Map Reduce
[ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ]
Parse Group Count Sort

[ f1,f2,.. ]
[ f1,f2,.. ]

Data [ f1, f2,... ] = tuples with ﬁeld names Data

• Alternative model & API to MapReduce
– pipe/ﬁlters of re-usable operations
• For rapidly implementing Data Processing
Systems
• Open-Source

Emerging Tool Support
• Karmasphere IDE (soon)
– Developing and Debugging
• Bixo (Bixo Labs) Data Mining Toolkit
– Apache Nutch replacement
– Easier to customize to meet new business
models
• Clojure & JRuby Domain Speciﬁc
Languages (DSL)
– Machine Learning
– Simple/Complex Ad-Hoc queries

Practical Applications
• Log/event analysis, device and system
monitoring
• Web crawling and content mining
• Behavior ad-targeting segmentation
• Ad campaign ROI
• Demand and event prediction
• POS analytics for product demand pricing

Successes
• Publicis/RazorFish - Behavioral Ad-
Targeting
– Cascading + AWS (Elastic MapReduce)
– Daily automated User Behavior
Segmentation
– 6wks dev, 3T/day, $13k/mo
– 500% increase in return on ad spend
from a similar campaign a year before

continued...
• FlightCaster - Predicting ﬂight delays
– Clojure + Cascading + AWS
– Machine learning and production
processing
– 3mos dev, 10G day, <1T total currently,
<$2k/mos
• Etsy - Online Marketplace
– JRuby + Cascading
– Data mining (Hadoop as a DW!)
– 750M page-views/mo, 60G/day of logs

Resources
• Chris K Wensel
– chris@wensel.net
– @cwensel
• Cascading
– an API for optimizing production data
processing
– http://cascading.org
• Concurrent, Inc.
– Support and Mentoring
– http://concurrentinc.com

Processing Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Processing Big Data

Similar to Processing Big Data (20)

More from cwensel

More from cwensel (7)

Recently uploaded

Recently uploaded (20)

Processing Big Data

Editor's Notes