Intro to Big Data - Orlando Code Camp 2014
Upcoming SlideShare
Loading in...5
×
 

Intro to Big Data - Orlando Code Camp 2014

on

  • 558 views

Very high-level introduction to Big Data technologies, with an emphasis on how folks can get started easily.

Very high-level introduction to Big Data technologies, with an emphasis on how folks can get started easily.

Statistics

Views

Total Views
558
Views on SlideShare
557
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Intro to Big Data - Orlando Code Camp 2014 Intro to Big Data - Orlando Code Camp 2014 Presentation Transcript

    • Dipping Your Toes into the Big Data Pool Orlando CodeCamp 2014 John Ternent VP Application Development TravelClick
    • About Me  20+ years as a consultant, software engineer, architect, and tech executive.  Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.  Presently leading development efforts for TravelClick Channel Management team.  Twitter : @jaternent
    • Poll : Big Data  How many people are comfortable with the definition?  How many people are “doing” Big Data?
    • Big Data in the Media  The Three Four V’s of Big Data:  Volume (Scale)  Variety (Forms)  Velocity (Streaming)  Veracity (Uncertainty)  http://www.ibmbigdatahub.com/infographic/four-vs- big-data
    • A New Definition  Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.  “It depends on how capital your B and D are in Big Data…”  What is Big Data to you?
    • The Big Data Ecosystem Data Sources Data Storage Data Manipulation Data Management Data Analysis • Sqoop • Flume • HDFS • HBase • Pig • MapReduce • Zookeeper • Avro • Oozie • Hive • Mahout • Impala
    • The Full Hadoop Ecosystem?
    • Great, but What IS Hadoop?  Implementation of Google MapReduce framework  Distributed processing on commodity hardware  Distributed file system with high failure tolerance  Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)
    • Candidate Architecture Data Sources • Log files • SQL DBs • Text feeds • Search • Structured • Unstructured • Semi- structured HDFS HDFS HDFS Data Manipulation • MapReduce • Pig • Hive • Impala Analytic Products • Search • R/SAS • Mahout • SQL Server • DW/DMa rt
    • Example : Log File Processing xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617 - - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590 xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360 - - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590 - - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591 xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975 xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855 xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
    • Example : Log File Processing A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN( (tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int)) REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)" (d+) (d+) (d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int); B1 = FILTER B BY ts IS NOT NULL; B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*'; B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req; C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time; D = GROUP C BY (month, day, hour, req, result); E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count; STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
    • Another Real-World Example  2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId ":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN ame":"expedia- dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su bmissionStatusCode":0}  2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu eName":"expedia- dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su bmissionStatusCode":null}  100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.
    • Pig Example - Pros and Cons  Pros:  Don’t need to ETL into a database, all off file system  Same development for one file as 10,000 files  Horizontally scalable  UDFs allow fine-grained control  Flexible  Cons:  Language can be difficult to work with  MapReduce touches ALL the things to get the answer (compare to indexed search)
    • Unstructured and Semi- Structured Data  Big Data tools can help with the analysis of data that would be more challenging in a relational database  Twitter feeds (Natural Language Processing)  Social network analysis  Big Data approaches to search are making search tools more accessible and useful than ever  ElasticSearch
    • ElasticSearch/Kibana JSON Documents REST ElasticSearch Logs Hadoop FileSystem Kibana
    • Analytics with Big Data  Apache Mahout  Machine learning on Hadoop  Recommendation  Classification  Clustering  RHadoop R mapreduce implementation on HDFS  Tableau  Visualization on HDFS/Hive Main point : You don’t have to roll your own for everything, many tools now using HDFS natively
    • Return to SQL  Many SQL dialects are being/have been ported to Hadoop  Hive : Create DDL Tables on top of HDFS structures CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE; SELECT host, COUNT(*) FROM apachelog GROUP BY host;
    • Cloudera Impala  Moves SQL processing onto each distributed node  Written for performance  Distribution and reduction of the query handled by the Impala engine
    • Big Data Tradeoffs  Time tradeoff – loading/building/indexing vs. runtime  ACID properties – different distribution models may compromise one or more of these properties  Be aware of what tradeoffs you’re making  TANSTAAFL – massive scalability, commodity hardware, but at what price?  Tool sophistication
    • NoSQL – “Not Only SQL”  Sacrificing ACID properties for different scalability benefits.  Key/Value Store : SimpleDB, Riak, Redis  Column Family Store : Cassandra, HBase  Document Database : CouchDB, MongoDB  Graph Database : Neo4J  General properties  High horizontal scalability  Fast access  Simple data structures  Caching
    • Getting Started  Play in the sandbox – Hadoop/Hive/Pig local mode or AWS  Randy Zwitch has a great tutorial on this :  http://randyzwitch.com/big-data-hadoop-amazon-ec2- cloudera-part-1/  Using Airline data :  http://stat-computing.org/dataexpo/2009/the-data.html  Kaggle competitions (data science)  Lots of big data sets available, look for machine learning repositories
    • Getting Started  Books for Developers  Books for Managers
    • MOOCs  Unprecedented access to very high-quality online courses, including  Udacity : Data Science Track  Intro to Data Science  Data Wrangling with MongoDB  Intro to Hadoop and MapReduce  Coursera :  Machine Learning course  Data Science Certificate Track (R, Python)  Waikato University : Weka
    • Bonus Round : Data Science
    • Outro  We live in exciting times!  Confluence of data, processing power, and algorithmic sophistication.  More data is available to make better decisions more easily than any other time in human history.