Hadoop as Data Refinery - Steve Loughran

Hadoop as a Data Refinery

Steve Loughran– Hortonworks
@steveloughran
London, October 2012

© Hortonworks Inc. 2012

About me:
• HP Labs:
–Deployment, cloud infrastructure, Hadoop-in-Cloud
• Apache – member and committer
–Ant, Axis ; author: Ant in Action
–Hadoop
–Dynamic deployments
–Diagnostics on failures
–Cloud infrastructure integration
• Joined Hortonworks in 2012
–UK based: R&D

Page 2

What is Apache Hadoop?

• Collection of Open Source Projects One of the best examples of
– Apache Software Foundation (ASF) open source driving innovation
– commercial and community development and creating a market

• Foundation for Big Data Solutions
– Stores petabytes of data reliably
– Runs highly distributed computation
– Commodity servers & storage
– Powers data-driven business

Page 3

Why Hadoop?
Business Pressure
1 Opportunity to enable innovative new business models

2 Potential new insights that drive competitive advantage

Technical Pressure
3 Data collected and stored continues to grow exponentially

4 Data is increasingly everywhere and in many formats

5 Traditional solutions not designed for new requirements

Financial Pressure
6 Cost of data systems, as % of IT spend, continues to grow

7 Cost advantages of commodity hardware & open source

Page 4

The data refinery in an enterprise
Audio, Web, Mobile, CRM,
Video, ERP, SCM, …
Images
New Data Business
Transactions
Docs, Sources
Text, & Interactions
XML

HDFS
Web
Logs,
Clicks
Big Data
SQL NoSQL NewSQL
Social, Refinery
Graph, ETL
Feeds

EDW MPP NewSQL
Sensors,
Devices,
RFID

Business
Pig
Spatial, Intelligence
GPS Apache Hadoop
& Analytics
Events,
Other Dashboards, Reports,
Visualization, …

Page 5

Modernising Business Intelligence
• Before:
– Current records & short history
– Analytics/BI systems keep conformed / cleaned / digested data
– Unstructured data locked silos, archived offline
Inflexible, new questions require system redesigns

• Now
– Keep raw data in Hadoop for a long time
– Reprocess/enhance analytics/BI data on-demand
– Can directly experiment on all raw data
– New products / services can be added very quickly
Storage and agility justifies new infrastructure

Page 6

Refineries pull in raw data
Internal: pipelines with Apache Flume
– Web site logs
– Real-world events: retail, financial, vehicle movements
– New data sources you create
The data you couldn't afford to keep

External: pipelines and bulk deliveries
– Correlating data: weather, market, competition
– New sources -twitter feeds, infochimps, open government
– Real-world events: retail, financial
– Apache Sqoop
To help understand your own data

Page 8

Refineries refine raw data
• Clean up raw data
• Filter “cleaned” data

• Forward data to different destinations:
– Existing BI infrastructure
– New “Agile Data” infrastructures

• Offload work from the core Data Warehouse
– ETL operations
– Report and Chart Generation
– Ad-hoc queries

Needs: query, workflow and reporting tools
Page 9

Refineries can store data
• Retain historical transaction data, analyses
• Store (cleaned, filtered, compressed) raw data
• Provide the history for more advanced analysis in
future applications and queries

• Needs: storage, query tools
– Storage: HDFS and HBase
– Languages: Pig & Hive
– Workflow for scheduled jobs: Oozie
– Shared schema repository: HCatalog

Hadoop makes storing bulk & historical data affordable
Page 10

What if I didn't have a Data
Warehouse?

Page 12

Congratulations!

1. HBase: scale, Hadoop integration

2. mongoDB, CouchDB, Riak
good for web UIs

3. Postgres, MySQL, …
transactions
Page 13

Agile Data

Page 14

Agile Data
• SQL Experts: Hive HQL queries
• Ad-hoc queries: Pig
• Statistics platform: R + Hadoop
• Visualisation tools –including Excel
• New web UI applications

Because you don’t know all that you are looking for
when you collect the data

Page 15

Pig: an Agile Data language
• Optimised for refining data
• Dataflow-driven –much higher level than Java
• Macros and User Defined Functions
• ILLUSTRATE aids development
• For ad-hoc and production use

Page 17

Example: Packetpig
snort_alerts = LOAD '$pcap'
USING
com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts
GENERATE
com.packetloop.packetpig.udf.geoip.Country(src) as country,
priority;

countries = GROUP countries BY country;

countries = FOREACH countries
GENERATE
group,
AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');

Page 18

web UI: d3.js

Page 19

Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative

Page 20

Developers: learn statistics via Pig

Data Scientists & Statisticians:
learn Pig (and R)

Russ Jurney @ HUG UK in November
meetup.com/hadoop-users-group-uk/
Page 21

Challenge:
Becoming a data-driven organisation

Page 22

Challenges
• Thinking of the right questions to ask

• Conducting valid experiments:
A/B testing, surveys with effective sampling, …
– Not: "try a web new design for a week"
– Not: "please do a site survey" pop-up dialog

• Accepting negative results
– "no design was better than the other"

• Accepting results you don't agree with
– “trials imply the proposed strategy won't work”

Page 23

Example: Yahoo!
• Online Application logic driven by big lookup tables

• Lookup data computed periodically on Hadoop
– Machine learning, other expensive computation offline
– Personalization, classification, fraud, value analysis…

• Application development requires data science
– Huge amounts of actually observed data key to modern apps
– Hadoop used as the science platform

Architecting
© Hortonworks Inc. 2012 the Future of Big Data
Page 24

Yahoo! Homepage

• Serving Maps SCIENCE » Machine learning to build ever
• Users - Interests HADOOP better categorization models
CLUSTER
• Five Minute CATEGORIZATION
USER
Production BEHAVIOR MODELS (weekly)

• Weekly PRODUCTION
Categorization HADOOP » Identify user interests using
CLUSTER
models SERVING Categorization models
MAPS
(every 5 minutes)
USER
BEHAVIOR

SERVING SYSTEMS ENGAGED USERS

Build customised home pages with latest data (thousands / second)
Copyright Yahoo 2011 25

Conclusions

Hadoop can live alongside existing BI
systems –as a data refinery

• Store, refine bulk & unstructured data
• Archive data for long-term analysis
• Support ad-hoc queries over bulk data
• Become the data-science platform

26

Thank You!
Questions & Answers

hortonworks.com/download

Page 27

Hadoop as Data Refinery - Steve Loughran

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop as Data Refinery - Steve Loughran

Similar to Hadoop as Data Refinery - Steve Loughran (20)

More from JAX London

More from JAX London (20)

Hadoop as Data Refinery - Steve Loughran

Editor's Notes