Dipping Your Toes into the
Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick
About Me
 20+ years as a consultant, software engineer, architect,
and tech executive.
 Mostly data-focused, RDBMS, object database, and big
data/NoSQL/analytics/data science.
 Presently leading development efforts for TravelClick
Channel Management team.
 Twitter : @jaternent
Poll : Big Data
 How many people are comfortable with the definition?
 How many people are “doing” Big Data?
Big Data in the Media
 The Three Four V’s of Big Data:
 Volume (Scale)
 Variety (Forms)
 Velocity (Streaming)
 Veracity (Uncertainty)
 http://www.ibmbigdatahub.com/infographic/four-vs-
big-data
A New Definition
 Big Data is about a tool set and approach that allows for
non-linear scalability of solutions to data problems.
 “It depends on how capital your B and D are in Big
Data…”
 What is Big Data to you?
The Big Data Ecosystem
Data
Sources
Data
Storage
Data
Manipulation
Data
Management
Data
Analysis
• Sqoop
• Flume
• HDFS
• HBase
• Pig
• MapReduce
• Zookeeper
• Avro
• Oozie
• Hive
• Mahout
• Impala
The Full Hadoop Ecosystem?
Great, but What IS Hadoop?
 Implementation of Google MapReduce framework
 Distributed processing on commodity hardware
 Distributed file system with high failure tolerance
 Can support activity directly on top of distributed file
system (MapReduce jobs, Impala, Hive queries, etc)
Candidate Architecture
Data Sources
• Log files
• SQL DBs
• Text feeds
• Search
• Structured
• Unstructured
• Semi-
structured
HDFS
HDFS
HDFS
Data
Manipulation
• MapReduce
• Pig
• Hive
• Impala
Analytic
Products
• Search
• R/SAS
• Mahout
• SQL
Server
• DW/DMa
rt
Example : Log File Processing
xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617
- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590
xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360
- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590
- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591
xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975
xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855
xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790
xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542
xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
Example : Log File Processing
A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray);
B = FOREACH A GENERATE FLATTEN(
(tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int,
int, int))
REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)" (d+) (d+) (d+)'))
as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray,
req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray,
svc_time:int, rec_bytes:int, resp_bytes:int);
B1 = FILTER B BY ts IS NOT NULL;
B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';
B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req;
C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as
month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day,
GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;
D = GROUP C BY (month, day, hour, req, result);
E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min,
COUNT(C) as count;
STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
Another Real-World Example
 2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug
10, 2013 4:03:50
AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId
":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN
ame":"expedia-
dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su
bmissionStatusCode":0}
 2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug
10, 2013 4:03:53
AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI
d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu
eName":"expedia-
dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su
bmissionStatusCode":null}
 100 million (ish) / week of these. 25MB zipped per server per day (15
servers right now), 750MB uncompressed.
Pig Example - Pros and Cons
 Pros:
 Don’t need to ETL into a database, all off file system
 Same development for one file as 10,000 files
 Horizontally scalable
 UDFs allow fine-grained control
 Flexible
 Cons:
 Language can be difficult to work with
 MapReduce touches ALL the things to get the answer
(compare to indexed search)
Unstructured and Semi-
Structured Data
 Big Data tools can help with the analysis of data that
would be more challenging in a relational database
 Twitter feeds (Natural Language Processing)
 Social network analysis
 Big Data approaches to search are making search tools
more accessible and useful than ever
 ElasticSearch
ElasticSearch/Kibana
JSON
Documents
REST ElasticSearch
Logs
Hadoop
FileSystem
Kibana
Analytics with Big Data
 Apache Mahout
 Machine learning on Hadoop
 Recommendation
 Classification
 Clustering
 RHadoop
R mapreduce implementation on HDFS
 Tableau
 Visualization on HDFS/Hive
Main point : You don’t have to roll your own for everything, many tools now
using HDFS natively
Return to SQL
 Many SQL dialects are being/have been ported to
Hadoop
 Hive : Create DDL Tables on top of HDFS structures
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^
"]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^
"]*|".*"))?"
)
STORED AS TEXTFILE;
SELECT host, COUNT(*)
FROM apachelog
GROUP BY host;
Cloudera Impala
 Moves SQL processing onto each distributed node
 Written for performance
 Distribution and reduction of the query handled by the Impala
engine
Big Data Tradeoffs
 Time tradeoff – loading/building/indexing vs. runtime
 ACID properties – different distribution models may
compromise one or more of these properties
 Be aware of what tradeoffs you’re making
 TANSTAAFL – massive scalability, commodity hardware,
but at what price?
 Tool sophistication
NoSQL – “Not Only SQL”
 Sacrificing ACID properties for different scalability
benefits.
 Key/Value Store : SimpleDB, Riak, Redis
 Column Family Store : Cassandra, HBase
 Document Database : CouchDB, MongoDB
 Graph Database : Neo4J
 General properties
 High horizontal scalability
 Fast access
 Simple data structures
 Caching
Getting Started
 Play in the sandbox – Hadoop/Hive/Pig local mode or
AWS
 Randy Zwitch has a great tutorial on this :
 http://randyzwitch.com/big-data-hadoop-amazon-ec2-
cloudera-part-1/
 Using Airline data :
 http://stat-computing.org/dataexpo/2009/the-data.html
 Kaggle competitions (data science)
 Lots of big data sets available, look for machine
learning repositories
Getting Started
 Books for Developers
 Books for Managers
MOOCs
 Unprecedented access to very high-quality online
courses, including
 Udacity : Data Science Track
 Intro to Data Science
 Data Wrangling with MongoDB
 Intro to Hadoop and MapReduce
 Coursera :
 Machine Learning course
 Data Science Certificate Track (R, Python)
 Waikato University : Weka
Bonus Round : Data Science
Outro
 We live in exciting times!
 Confluence of data, processing power, and algorithmic
sophistication.
 More data is available to make better decisions more
easily than any other time in human history.

Intro to Big Data - Orlando Code Camp 2014

  • 1.
    Dipping Your Toesinto the Big Data Pool Orlando CodeCamp 2014 John Ternent VP Application Development TravelClick
  • 2.
    About Me  20+years as a consultant, software engineer, architect, and tech executive.  Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.  Presently leading development efforts for TravelClick Channel Management team.  Twitter : @jaternent
  • 3.
    Poll : BigData  How many people are comfortable with the definition?  How many people are “doing” Big Data?
  • 4.
    Big Data inthe Media  The Three Four V’s of Big Data:  Volume (Scale)  Variety (Forms)  Velocity (Streaming)  Veracity (Uncertainty)  http://www.ibmbigdatahub.com/infographic/four-vs- big-data
  • 5.
    A New Definition Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.  “It depends on how capital your B and D are in Big Data…”  What is Big Data to you?
  • 6.
    The Big DataEcosystem Data Sources Data Storage Data Manipulation Data Management Data Analysis • Sqoop • Flume • HDFS • HBase • Pig • MapReduce • Zookeeper • Avro • Oozie • Hive • Mahout • Impala
  • 7.
    The Full HadoopEcosystem?
  • 8.
    Great, but WhatIS Hadoop?  Implementation of Google MapReduce framework  Distributed processing on commodity hardware  Distributed file system with high failure tolerance  Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)
  • 9.
    Candidate Architecture Data Sources •Log files • SQL DBs • Text feeds • Search • Structured • Unstructured • Semi- structured HDFS HDFS HDFS Data Manipulation • MapReduce • Pig • Hive • Impala Analytic Products • Search • R/SAS • Mahout • SQL Server • DW/DMa rt
  • 10.
    Example : LogFile Processing xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617 - - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590 xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360 - - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590 - - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591 xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975 xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855 xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
  • 11.
    Example : LogFile Processing A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN( (tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int)) REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)" (d+) (d+) (d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int); B1 = FILTER B BY ts IS NOT NULL; B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*'; B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req; C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time; D = GROUP C BY (month, day, hour, req, result); E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count; STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
  • 12.
    Another Real-World Example 2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId ":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN ame":"expedia- dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su bmissionStatusCode":0}  2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu eName":"expedia- dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su bmissionStatusCode":null}  100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.
  • 13.
    Pig Example -Pros and Cons  Pros:  Don’t need to ETL into a database, all off file system  Same development for one file as 10,000 files  Horizontally scalable  UDFs allow fine-grained control  Flexible  Cons:  Language can be difficult to work with  MapReduce touches ALL the things to get the answer (compare to indexed search)
  • 14.
    Unstructured and Semi- StructuredData  Big Data tools can help with the analysis of data that would be more challenging in a relational database  Twitter feeds (Natural Language Processing)  Social network analysis  Big Data approaches to search are making search tools more accessible and useful than ever  ElasticSearch
  • 15.
  • 16.
    Analytics with BigData  Apache Mahout  Machine learning on Hadoop  Recommendation  Classification  Clustering  RHadoop R mapreduce implementation on HDFS  Tableau  Visualization on HDFS/Hive Main point : You don’t have to roll your own for everything, many tools now using HDFS natively
  • 17.
    Return to SQL Many SQL dialects are being/have been ported to Hadoop  Hive : Create DDL Tables on top of HDFS structures CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE; SELECT host, COUNT(*) FROM apachelog GROUP BY host;
  • 18.
    Cloudera Impala  MovesSQL processing onto each distributed node  Written for performance  Distribution and reduction of the query handled by the Impala engine
  • 19.
    Big Data Tradeoffs Time tradeoff – loading/building/indexing vs. runtime  ACID properties – different distribution models may compromise one or more of these properties  Be aware of what tradeoffs you’re making  TANSTAAFL – massive scalability, commodity hardware, but at what price?  Tool sophistication
  • 20.
    NoSQL – “NotOnly SQL”  Sacrificing ACID properties for different scalability benefits.  Key/Value Store : SimpleDB, Riak, Redis  Column Family Store : Cassandra, HBase  Document Database : CouchDB, MongoDB  Graph Database : Neo4J  General properties  High horizontal scalability  Fast access  Simple data structures  Caching
  • 21.
    Getting Started  Playin the sandbox – Hadoop/Hive/Pig local mode or AWS  Randy Zwitch has a great tutorial on this :  http://randyzwitch.com/big-data-hadoop-amazon-ec2- cloudera-part-1/  Using Airline data :  http://stat-computing.org/dataexpo/2009/the-data.html  Kaggle competitions (data science)  Lots of big data sets available, look for machine learning repositories
  • 22.
    Getting Started  Booksfor Developers  Books for Managers
  • 23.
    MOOCs  Unprecedented accessto very high-quality online courses, including  Udacity : Data Science Track  Intro to Data Science  Data Wrangling with MongoDB  Intro to Hadoop and MapReduce  Coursera :  Machine Learning course  Data Science Certificate Track (R, Python)  Waikato University : Weka
  • 24.
    Bonus Round :Data Science
  • 25.
    Outro  We livein exciting times!  Confluence of data, processing power, and algorithmic sophistication.  More data is available to make better decisions more easily than any other time in human history.