Intro to Big Data - Orlando Code Camp 2014

Dipping Your Toes into the
Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick

About Me
 20+ years as a consultant, software engineer, architect,
and tech executive.
 Mostly data-focused, RDBMS, object database, and big
data/NoSQL/analytics/data science.
 Presently leading development efforts for TravelClick
Channel Management team.
 Twitter : @jaternent

Poll : Big Data
 How many people are comfortable with the definition?
 How many people are “doing” Big Data?

Big Data in the Media
 The Three Four V’s of Big Data:
 Volume (Scale)
 Variety (Forms)
 Velocity (Streaming)
 Veracity (Uncertainty)
 http://www.ibmbigdatahub.com/infographic/four-vs-
big-data

A New Definition
 Big Data is about a tool set and approach that allows for
non-linear scalability of solutions to data problems.
 “It depends on how capital your B and D are in Big
Data…”
 What is Big Data to you?

The Big Data Ecosystem
Data
Sources
Data
Storage
Data
Manipulation
Data
Management
Data
Analysis
• Sqoop
• Flume
• HDFS
• HBase
• Pig
• MapReduce
• Zookeeper
• Avro
• Oozie
• Hive
• Mahout
• Impala

Great, but What IS Hadoop?
 Implementation of Google MapReduce framework
 Distributed processing on commodity hardware
 Distributed file system with high failure tolerance
 Can support activity directly on top of distributed file
system (MapReduce jobs, Impala, Hive queries, etc)

Candidate Architecture
Data Sources
• Log files
• SQL DBs
• Text feeds
• Search
• Structured
• Unstructured
• Semi-
structured
HDFS
HDFS
HDFS
Data
Manipulation
• MapReduce
• Pig
• Hive
• Impala
Analytic
Products
• Search
• R/SAS
• Mahout
• SQL
Server
• DW/DMa
rt

Example : Log File Processing
xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617
- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590
xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360
- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590
- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591
xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975

Example : Log File Processing
A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray);
B = FOREACH A GENERATE FLATTEN(
(tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int,
int, int))
REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)" (d+) (d+) (d+)'))
as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray,
req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray,
svc_time:int, rec_bytes:int, resp_bytes:int);
B1 = FILTER B BY ts IS NOT NULL;
B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';
B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req;
C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as
month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day,
GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;
D = GROUP C BY (month, day, hour, req, result);
E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min,
COUNT(C) as count;
STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage

Another Real-World Example
 2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug
10, 2013 4:03:50
AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId
":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN
ame":"expedia-
dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su
bmissionStatusCode":0}
 2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug
10, 2013 4:03:53
AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI
d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu
eName":"expedia-
dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su
bmissionStatusCode":null}
 100 million (ish) / week of these. 25MB zipped per server per day (15
servers right now), 750MB uncompressed.

Pig Example - Pros and Cons
 Pros:
 Don’t need to ETL into a database, all off file system
 Same development for one file as 10,000 files
 Horizontally scalable
 UDFs allow fine-grained control
 Flexible
 Cons:
 Language can be difficult to work with
 MapReduce touches ALL the things to get the answer
(compare to indexed search)

Unstructured and Semi-
Structured Data
 Big Data tools can help with the analysis of data that
would be more challenging in a relational database
 Twitter feeds (Natural Language Processing)
 Social network analysis
 Big Data approaches to search are making search tools
more accessible and useful than ever
 ElasticSearch

ElasticSearch/Kibana
JSON
Documents
REST ElasticSearch
Logs
Hadoop
FileSystem
Kibana

Analytics with Big Data
 Apache Mahout
 Machine learning on Hadoop
 Recommendation
 Classification
 Clustering
 RHadoop
R mapreduce implementation on HDFS
 Tableau
 Visualization on HDFS/Hive
Main point : You don’t have to roll your own for everything, many tools now
using HDFS natively

Return to SQL
 Many SQL dialects are being/have been ported to
Hadoop
 Hive : Create DDL Tables on top of HDFS structures
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^
"]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^
"]*|".*"))?"
)
STORED AS TEXTFILE;
SELECT host, COUNT(*)
FROM apachelog
GROUP BY host;

Cloudera Impala
 Moves SQL processing onto each distributed node
 Written for performance
 Distribution and reduction of the query handled by the Impala
engine

Big Data Tradeoffs
 Time tradeoff – loading/building/indexing vs. runtime
 ACID properties – different distribution models may
compromise one or more of these properties
 Be aware of what tradeoffs you’re making
 TANSTAAFL – massive scalability, commodity hardware,
but at what price?
 Tool sophistication

NoSQL – “Not Only SQL”
 Sacrificing ACID properties for different scalability
benefits.
 Key/Value Store : SimpleDB, Riak, Redis
 Column Family Store : Cassandra, HBase
 Document Database : CouchDB, MongoDB
 Graph Database : Neo4J
 General properties
 High horizontal scalability
 Fast access
 Simple data structures
 Caching

Getting Started
 Play in the sandbox – Hadoop/Hive/Pig local mode or
AWS
 Randy Zwitch has a great tutorial on this :
 http://randyzwitch.com/big-data-hadoop-amazon-ec2-
cloudera-part-1/
 Using Airline data :
 http://stat-computing.org/dataexpo/2009/the-data.html
 Kaggle competitions (data science)
 Lots of big data sets available, look for machine
learning repositories

Getting Started
 Books for Developers
 Books for Managers

MOOCs
 Unprecedented access to very high-quality online
courses, including
 Udacity : Data Science Track
 Intro to Data Science
 Data Wrangling with MongoDB
 Intro to Hadoop and MapReduce
 Coursera :
 Machine Learning course
 Data Science Certificate Track (R, Python)
 Waikato University : Weka

Outro
 We live in exciting times!
 Confluence of data, processing power, and algorithmic
sophistication.
 More data is available to make better decisions more
easily than any other time in human history.

Intro to Big Data - Orlando Code Camp 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to Big Data - Orlando Code Camp 2014

Similar to Intro to Big Data - Orlando Code Camp 2014 (20)

Recently uploaded

Recently uploaded (20)

Intro to Big Data - Orlando Code Camp 2014