SlideShare a Scribd company logo
1 of 25
Dipping Your Toes into the
Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick
About Me
 20+ years as a consultant, software engineer, architect,
and tech executive.
 Mostly data-focused, RDBMS, object database, and big
data/NoSQL/analytics/data science.
 Presently leading development efforts for TravelClick
Channel Management team.
 Twitter : @jaternent
Poll : Big Data
 How many people are comfortable with the definition?
 How many people are “doing” Big Data?
Big Data in the Media
 The Three Four V’s of Big Data:
 Volume (Scale)
 Variety (Forms)
 Velocity (Streaming)
 Veracity (Uncertainty)
 http://www.ibmbigdatahub.com/infographic/four-vs-
big-data
A New Definition
 Big Data is about a tool set and approach that allows for
non-linear scalability of solutions to data problems.
 “It depends on how capital your B and D are in Big
Data…”
 What is Big Data to you?
The Big Data Ecosystem
Data
Sources
Data
Storage
Data
Manipulation
Data
Management
Data
Analysis
• Sqoop
• Flume
• HDFS
• HBase
• Pig
• MapReduce
• Zookeeper
• Avro
• Oozie
• Hive
• Mahout
• Impala
The Full Hadoop Ecosystem?
Great, but What IS Hadoop?
 Implementation of Google MapReduce framework
 Distributed processing on commodity hardware
 Distributed file system with high failure tolerance
 Can support activity directly on top of distributed file
system (MapReduce jobs, Impala, Hive queries, etc)
Candidate Architecture
Data Sources
• Log files
• SQL DBs
• Text feeds
• Search
• Structured
• Unstructured
• Semi-
structured
HDFS
HDFS
HDFS
Data
Manipulation
• MapReduce
• Pig
• Hive
• Impala
Analytic
Products
• Search
• R/SAS
• Mahout
• SQL
Server
• DW/DMa
rt
Example : Log File Processing
xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617
- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590
xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360
- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590
- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591
xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975
xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855
xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790
xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542
xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
Example : Log File Processing
A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray);
B = FOREACH A GENERATE FLATTEN(
(tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int,
int, int))
REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)" (d+) (d+) (d+)'))
as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray,
req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray,
svc_time:int, rec_bytes:int, resp_bytes:int);
B1 = FILTER B BY ts IS NOT NULL;
B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';
B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req;
C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as
month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day,
GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;
D = GROUP C BY (month, day, hour, req, result);
E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min,
COUNT(C) as count;
STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
Another Real-World Example
 2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug
10, 2013 4:03:50
AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId
":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN
ame":"expedia-
dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su
bmissionStatusCode":0}
 2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug
10, 2013 4:03:53
AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI
d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu
eName":"expedia-
dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-
11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su
bmissionStatusCode":null}
 100 million (ish) / week of these. 25MB zipped per server per day (15
servers right now), 750MB uncompressed.
Pig Example - Pros and Cons
 Pros:
 Don’t need to ETL into a database, all off file system
 Same development for one file as 10,000 files
 Horizontally scalable
 UDFs allow fine-grained control
 Flexible
 Cons:
 Language can be difficult to work with
 MapReduce touches ALL the things to get the answer
(compare to indexed search)
Unstructured and Semi-
Structured Data
 Big Data tools can help with the analysis of data that
would be more challenging in a relational database
 Twitter feeds (Natural Language Processing)
 Social network analysis
 Big Data approaches to search are making search tools
more accessible and useful than ever
 ElasticSearch
ElasticSearch/Kibana
JSON
Documents
REST ElasticSearch
Logs
Hadoop
FileSystem
Kibana
Analytics with Big Data
 Apache Mahout
 Machine learning on Hadoop
 Recommendation
 Classification
 Clustering
 RHadoop
R mapreduce implementation on HDFS
 Tableau
 Visualization on HDFS/Hive
Main point : You don’t have to roll your own for everything, many tools now
using HDFS natively
Return to SQL
 Many SQL dialects are being/have been ported to
Hadoop
 Hive : Create DDL Tables on top of HDFS structures
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^
"]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^
"]*|".*"))?"
)
STORED AS TEXTFILE;
SELECT host, COUNT(*)
FROM apachelog
GROUP BY host;
Cloudera Impala
 Moves SQL processing onto each distributed node
 Written for performance
 Distribution and reduction of the query handled by the Impala
engine
Big Data Tradeoffs
 Time tradeoff – loading/building/indexing vs. runtime
 ACID properties – different distribution models may
compromise one or more of these properties
 Be aware of what tradeoffs you’re making
 TANSTAAFL – massive scalability, commodity hardware,
but at what price?
 Tool sophistication
NoSQL – “Not Only SQL”
 Sacrificing ACID properties for different scalability
benefits.
 Key/Value Store : SimpleDB, Riak, Redis
 Column Family Store : Cassandra, HBase
 Document Database : CouchDB, MongoDB
 Graph Database : Neo4J
 General properties
 High horizontal scalability
 Fast access
 Simple data structures
 Caching
Getting Started
 Play in the sandbox – Hadoop/Hive/Pig local mode or
AWS
 Randy Zwitch has a great tutorial on this :
 http://randyzwitch.com/big-data-hadoop-amazon-ec2-
cloudera-part-1/
 Using Airline data :
 http://stat-computing.org/dataexpo/2009/the-data.html
 Kaggle competitions (data science)
 Lots of big data sets available, look for machine
learning repositories
Getting Started
 Books for Developers
 Books for Managers
MOOCs
 Unprecedented access to very high-quality online
courses, including
 Udacity : Data Science Track
 Intro to Data Science
 Data Wrangling with MongoDB
 Intro to Hadoop and MapReduce
 Coursera :
 Machine Learning course
 Data Science Certificate Track (R, Python)
 Waikato University : Weka
Bonus Round : Data Science
Outro
 We live in exciting times!
 Confluence of data, processing power, and algorithmic
sophistication.
 More data is available to make better decisions more
easily than any other time in human history.

More Related Content

What's hot

Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBig Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBigDataExpo
 
201809 DB tech showcase
201809 DB tech showcase201809 DB tech showcase
201809 DB tech showcaseKeisuke Suzuki
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcasesAndrii Gakhov
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB MongoDB
 
Building a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidBuilding a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidImply
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and RoadmapImply
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...njcar
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed ProcessesImply
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 
USQ Landdemos Azure Data Lake
USQ Landdemos Azure Data LakeUSQ Landdemos Azure Data Lake
USQ Landdemos Azure Data LakeTrivadis
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 

What's hot (20)

Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBig Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it all
 
201809 DB tech showcase
201809 DB tech showcase201809 DB tech showcase
201809 DB tech showcase
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
Time series databases
Time series databasesTime series databases
Time series databases
 
Building a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidBuilding a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache Druid
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and Roadmap
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed Processes
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
 
USQ Landdemos Azure Data Lake
USQ Landdemos Azure Data LakeUSQ Landdemos Azure Data Lake
USQ Landdemos Azure Data Lake
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 

Similar to Intro to Big Data - Orlando Code Camp 2014

Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Certus Solutions
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseDean Hallman
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
 
Unleash the Power of Azure Data Factory - SQL User Group
Unleash the Power of Azure Data Factory - SQL User GroupUnleash the Power of Azure Data Factory - SQL User Group
Unleash the Power of Azure Data Factory - SQL User GroupSergio Zenatti Filho
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorialrustd
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club
 

Similar to Intro to Big Data - Orlando Code Camp 2014 (20)

Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Unleash the Power of Azure Data Factory - SQL User Group
Unleash the Power of Azure Data Factory - SQL User GroupUnleash the Power of Azure Data Factory - SQL User Group
Unleash the Power of Azure Data Factory - SQL User Group
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 

Recently uploaded

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Intro to Big Data - Orlando Code Camp 2014

  • 1. Dipping Your Toes into the Big Data Pool Orlando CodeCamp 2014 John Ternent VP Application Development TravelClick
  • 2. About Me  20+ years as a consultant, software engineer, architect, and tech executive.  Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.  Presently leading development efforts for TravelClick Channel Management team.  Twitter : @jaternent
  • 3. Poll : Big Data  How many people are comfortable with the definition?  How many people are “doing” Big Data?
  • 4. Big Data in the Media  The Three Four V’s of Big Data:  Volume (Scale)  Variety (Forms)  Velocity (Streaming)  Veracity (Uncertainty)  http://www.ibmbigdatahub.com/infographic/four-vs- big-data
  • 5. A New Definition  Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.  “It depends on how capital your B and D are in Big Data…”  What is Big Data to you?
  • 6. The Big Data Ecosystem Data Sources Data Storage Data Manipulation Data Management Data Analysis • Sqoop • Flume • HDFS • HBase • Pig • MapReduce • Zookeeper • Avro • Oozie • Hive • Mahout • Impala
  • 7. The Full Hadoop Ecosystem?
  • 8. Great, but What IS Hadoop?  Implementation of Google MapReduce framework  Distributed processing on commodity hardware  Distributed file system with high failure tolerance  Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)
  • 9. Candidate Architecture Data Sources • Log files • SQL DBs • Text feeds • Search • Structured • Unstructured • Semi- structured HDFS HDFS HDFS Data Manipulation • MapReduce • Pig • Hive • Impala Analytic Products • Search • R/SAS • Mahout • SQL Server • DW/DMa rt
  • 10. Example : Log File Processing xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617 - - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590 xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360 - - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590 - - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591 xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975 xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855 xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542 xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
  • 11. Example : Log File Processing A = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN( (tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int)) REGEX_EXTRACT_ALL(line,'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)" (d+) (d+) (d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int); B1 = FILTER B BY ts IS NOT NULL; B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*'; B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^w+ /(S+)[?]* S+',1) as req; C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time; D = GROUP C BY (month, day, hour, req, result); E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count; STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
  • 12. Another Real-World Example  2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId ":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueN ame":"expedia- dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"su bmissionStatusCode":0}  2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionI d":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queu eName":"expedia- dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto- 11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"su bmissionStatusCode":null}  100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.
  • 13. Pig Example - Pros and Cons  Pros:  Don’t need to ETL into a database, all off file system  Same development for one file as 10,000 files  Horizontally scalable  UDFs allow fine-grained control  Flexible  Cons:  Language can be difficult to work with  MapReduce touches ALL the things to get the answer (compare to indexed search)
  • 14. Unstructured and Semi- Structured Data  Big Data tools can help with the analysis of data that would be more challenging in a relational database  Twitter feeds (Natural Language Processing)  Social network analysis  Big Data approaches to search are making search tools more accessible and useful than ever  ElasticSearch
  • 16. Analytics with Big Data  Apache Mahout  Machine learning on Hadoop  Recommendation  Classification  Clustering  RHadoop R mapreduce implementation on HDFS  Tableau  Visualization on HDFS/Hive Main point : You don’t have to roll your own for everything, many tools now using HDFS natively
  • 17. Return to SQL  Many SQL dialects are being/have been ported to Hadoop  Hive : Create DDL Tables on top of HDFS structures CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE; SELECT host, COUNT(*) FROM apachelog GROUP BY host;
  • 18. Cloudera Impala  Moves SQL processing onto each distributed node  Written for performance  Distribution and reduction of the query handled by the Impala engine
  • 19. Big Data Tradeoffs  Time tradeoff – loading/building/indexing vs. runtime  ACID properties – different distribution models may compromise one or more of these properties  Be aware of what tradeoffs you’re making  TANSTAAFL – massive scalability, commodity hardware, but at what price?  Tool sophistication
  • 20. NoSQL – “Not Only SQL”  Sacrificing ACID properties for different scalability benefits.  Key/Value Store : SimpleDB, Riak, Redis  Column Family Store : Cassandra, HBase  Document Database : CouchDB, MongoDB  Graph Database : Neo4J  General properties  High horizontal scalability  Fast access  Simple data structures  Caching
  • 21. Getting Started  Play in the sandbox – Hadoop/Hive/Pig local mode or AWS  Randy Zwitch has a great tutorial on this :  http://randyzwitch.com/big-data-hadoop-amazon-ec2- cloudera-part-1/  Using Airline data :  http://stat-computing.org/dataexpo/2009/the-data.html  Kaggle competitions (data science)  Lots of big data sets available, look for machine learning repositories
  • 22. Getting Started  Books for Developers  Books for Managers
  • 23. MOOCs  Unprecedented access to very high-quality online courses, including  Udacity : Data Science Track  Intro to Data Science  Data Wrangling with MongoDB  Intro to Hadoop and MapReduce  Coursera :  Machine Learning course  Data Science Certificate Track (R, Python)  Waikato University : Weka
  • 24. Bonus Round : Data Science
  • 25. Outro  We live in exciting times!  Confluence of data, processing power, and algorithmic sophistication.  More data is available to make better decisions more easily than any other time in human history.