Dipping Your Toes into the
Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick
About Me
 20+ years as a consultant, software engineer, architect,
and tech executive.
 Mostly data-focused, RDBMS, obje...
Poll : Big Data
 How many people are comfortable with the definition?
 How many people are “doing” Big Data?
Big Data in the Media
 The Three Four V’s of Big Data:
 Volume (Scale)
 Variety (Forms)
 Velocity (Streaming)
 Veraci...
A New Definition
 Big Data is about a tool set and approach that allows for
non-linear scalability of solutions to data p...
The Big Data Ecosystem
Data
Sources
Data
Storage
Data
Manipulation
Data
Management
Data
Analysis
• Sqoop
• Flume
• HDFS
• ...
The Full Hadoop Ecosystem?
Great, but What IS Hadoop?
 Implementation of Google MapReduce framework
 Distributed processing on commodity hardware
...
Candidate Architecture
Data Sources
• Log files
• SQL DBs
• Text feeds
• Search
• Structured
• Unstructured
• Semi-
struct...
Example : Log File Processing
xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" ...
Example : Log File Processing
A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray);
B = FOR...
Another Real-World Example
 2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug
10, 2013 4:03:50
AM","hot...
Pig Example - Pros and Cons
 Pros:
 Don’t need to ETL into a database, all off file system
 Same development for one fi...
Unstructured and Semi-
Structured Data
 Big Data tools can help with the analysis of data that
would be more challenging ...
ElasticSearch/Kibana
JSON
Documents
REST ElasticSearch
Logs
Hadoop
FileSystem
Kibana
Analytics with Big Data
 Apache Mahout
 Machine learning on Hadoop
 Recommendation
 Classification
 Clustering
 RHad...
Return to SQL
 Many SQL dialects are being/have been ported to
Hadoop
 Hive : Create DDL Tables on top of HDFS structure...
Cloudera Impala
 Moves SQL processing onto each distributed node
 Written for performance
 Distribution and reduction o...
Big Data Tradeoffs
 Time tradeoff – loading/building/indexing vs. runtime
 ACID properties – different distribution mode...
NoSQL – “Not Only SQL”
 Sacrificing ACID properties for different scalability
benefits.
 Key/Value Store : SimpleDB, Ria...
Getting Started
 Play in the sandbox – Hadoop/Hive/Pig local mode or
AWS
 Randy Zwitch has a great tutorial on this :
 ...
Getting Started
 Books for Developers
 Books for Managers
MOOCs
 Unprecedented access to very high-quality online
courses, including
 Udacity : Data Science Track
 Intro to Data...
Bonus Round : Data Science
Outro
 We live in exciting times!
 Confluence of data, processing power, and algorithmic
sophistication.
 More data is ...
Upcoming SlideShare
Loading in...5
×

Intro to Big Data - Orlando Code Camp 2014

706

Published on

Very high-level introduction to Big Data technologies, with an emphasis on how folks can get started easily.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
706
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Intro to Big Data - Orlando Code Camp 2014

  1. 1. Dipping Your Toes into the Big Data Pool Orlando CodeCamp 2014 John Ternent VP Application Development TravelClick
  2. 2. About Me  20+ years as a consultant, software engineer, architect, and tech executive.  Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.  Presently leading development efforts for TravelClick Channel Management team.  Twitter : @jaternent
  3. 3. Poll : Big Data  How many people are comfortable with the definition?  How many people are “doing” Big Data?
  4. 4. Big Data in the Media  The Three Four V’s of Big Data:  Volume (Scale)  Variety (Forms)  Velocity (Streaming)  Veracity (Uncertainty)  http://www.ibmbigdatahub.com/infographic/four-vs- big-data
  5. 5. A New Definition  Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.  “It depends on how capital your B and D are in Big Data…”  What is Big Data to you?
  6. 6. The Big Data Ecosystem Data Sources Data Storage Data Manipulation Data Management Data Analysis • Sqoop • Flume • HDFS • HBase • Pig • MapReduce • Zookeeper • Avro • Oozie • Hive • Mahout • Impala
  7. 7. The Full Hadoop Ecosystem?
  8. 8. Great, but What IS Hadoop?  Implementation of Google MapReduce framework  Distributed processing on commodity hardware  Distributed file system with high failure tolerance  Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)
  9. 9. Candidate Architecture Data Sources • Log files • SQL DBs • Text feeds • Search • Structured • Unstructured • Semi- structured HDFS HDFS HDFS Data Manipulation • MapReduce • Pig • Hive • Impala Analytic Products • Search • R/SAS • Mahout • SQL Server • DW/DMa rt
  10. 10. Example : Log File Processing xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617 - - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590 xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360 - - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590 - - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591 xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975 xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855 xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
  11. 11. Example : Log File Processing A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN( (tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int)) REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)" (d+) (d+) (d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int); B1 = FILTER B BY ts IS NOT NULL; B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*'; B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req; C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time; D = GROUP C BY (month, day, hour, req, result); E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count; STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
  12. 12. Another Real-World Example  2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId ":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN ame":"expedia- dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su bmissionStatusCode":0}  2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu eName":"expedia- dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su bmissionStatusCode":null}  100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.
  13. 13. Pig Example - Pros and Cons  Pros:  Don’t need to ETL into a database, all off file system  Same development for one file as 10,000 files  Horizontally scalable  UDFs allow fine-grained control  Flexible  Cons:  Language can be difficult to work with  MapReduce touches ALL the things to get the answer (compare to indexed search)
  14. 14. Unstructured and Semi- Structured Data  Big Data tools can help with the analysis of data that would be more challenging in a relational database  Twitter feeds (Natural Language Processing)  Social network analysis  Big Data approaches to search are making search tools more accessible and useful than ever  ElasticSearch
  15. 15. ElasticSearch/Kibana JSON Documents REST ElasticSearch Logs Hadoop FileSystem Kibana
  16. 16. Analytics with Big Data  Apache Mahout  Machine learning on Hadoop  Recommendation  Classification  Clustering  RHadoop R mapreduce implementation on HDFS  Tableau  Visualization on HDFS/Hive Main point : You don’t have to roll your own for everything, many tools now using HDFS natively
  17. 17. Return to SQL  Many SQL dialects are being/have been ported to Hadoop  Hive : Create DDL Tables on top of HDFS structures CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE; SELECT host, COUNT(*) FROM apachelog GROUP BY host;
  18. 18. Cloudera Impala  Moves SQL processing onto each distributed node  Written for performance  Distribution and reduction of the query handled by the Impala engine
  19. 19. Big Data Tradeoffs  Time tradeoff – loading/building/indexing vs. runtime  ACID properties – different distribution models may compromise one or more of these properties  Be aware of what tradeoffs you’re making  TANSTAAFL – massive scalability, commodity hardware, but at what price?  Tool sophistication
  20. 20. NoSQL – “Not Only SQL”  Sacrificing ACID properties for different scalability benefits.  Key/Value Store : SimpleDB, Riak, Redis  Column Family Store : Cassandra, HBase  Document Database : CouchDB, MongoDB  Graph Database : Neo4J  General properties  High horizontal scalability  Fast access  Simple data structures  Caching
  21. 21. Getting Started  Play in the sandbox – Hadoop/Hive/Pig local mode or AWS  Randy Zwitch has a great tutorial on this :  http://randyzwitch.com/big-data-hadoop-amazon-ec2- cloudera-part-1/  Using Airline data :  http://stat-computing.org/dataexpo/2009/the-data.html  Kaggle competitions (data science)  Lots of big data sets available, look for machine learning repositories
  22. 22. Getting Started  Books for Developers  Books for Managers
  23. 23. MOOCs  Unprecedented access to very high-quality online courses, including  Udacity : Data Science Track  Intro to Data Science  Data Wrangling with MongoDB  Intro to Hadoop and MapReduce  Coursera :  Machine Learning course  Data Science Certificate Track (R, Python)  Waikato University : Weka
  24. 24. Bonus Round : Data Science
  25. 25. Outro  We live in exciting times!  Confluence of data, processing power, and algorithmic sophistication.  More data is available to make better decisions more easily than any other time in human history.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×