SlideShare a Scribd company logo
Big Data: Technologies &
Challenges Facing Business
Today
A Practical Guide to Getting Started
My Name is Sara Robertson Hello
About Me
• I’m the VP of Technology at
CPX Interactive.
• Previously ran the platform
team at Warner Music Group.
• I choose technology based on
cost, efficiency, availability of
talent, valuation potential, long
term outlook, community
support, and other reasons
besides just pure tech.
• I love agile, open source, bio-
hacking, anime and martial
arts movies, emo, audiobooks.
• Favorite tech skills:
optimization, debugging.
About Me + Data
• Started at a mainframe
company doing Nagios-style
server & network monitoring.
• Spent my early years
obsessed with Oracle, then
Postgres, then MySQL.
• Most of my data experience is
in either high-traffic web
applications or back-office data
warehousing.
• I believe that data is about
humans as much as it is about
technology, and if your solution
doesn’t speak to your users
then it’s not really a solution.
8/14/2014 3
Big Data Landscape
8/14/2014 4
Big Data We’ll Talk About Today
Four Problems With Data Problems
Too Big Too Fast Too Disparate Too Unwieldy
system
creative_id
cpx_creative_name
viq_creative_name
alt_creative_id
alt_creative_name
size
file_size
click_track
audit_status
type
li_number
li_name
camp_number
camp_name
viq_placement_id
x_cslookup
bigint
string
tinyint
smallint
smallint
tinyint
tinyint
string
tinyint
tinyint
float
tinyint
float
float
float
float
int
tinyint
tinyint
bigint
string
string
string
string
tinyint
tinyint
tinyint
int
int
int
int
string
int
int
float
float
float
int
string
string
float
float
float
int
int
int
int
int
int
int
int
int
int
int
float
tinyint
int
tinyint
tinyint
int
int
int
tinyint
int
int
float
float
string
string
string
float
float
float
float
float
float
float
float
string
string
float
hourlycpxDataSiphonHour.py
auction_id_64
date_time
user_tz_offset
width
height
media_type
fold_position
event_type
imp_type
payment_type
media_cost_dollars_cpm
revenue_type
buyer_spend
buyer_bid
ecp
eap
is_imp
is_learn
predict_type_rev
othuser_id_64
ip_address
ip_address_trunc
geo_country
geo_region
operating_system
browser
language
venue_id
seller_member_id
publisher_id
site_id
site_domain
tag_id
external_inv_id
reserve_price
seller_revenue_cpm
media_buy_rev_share_pct
pub_rule_id
seller_currency
publisher_currency
publisher_exchange_rate
serving_fees_cpm
serving_fees_revshare
buyer_member_id
advertiser_id
brand_id
advertiser_frequency
advertiser_recency
insertion_order_id
campaign_group_id
campaign_id
creative_id
creative_freq
creative_rec
cadence_modifier
can_convert
user_group_id
is_control
control_pct
control_creative_id
is_click
pixel_id
is_remarketing
post_click_conv
post_view_conv
post_click_revenue
post_view_revenue
order_id
external_data
pricing_type
booked_revenue_dollars
booked_revenue_adv_curr
commission_cpm
commission_revshare
auction_service_deduction
auction_service_fees
creative_overage_fees
clear_fees
buyer_currency
advertiser_currency
advertiser_exchange_rate
x_datasiphon
Google’
s qpms
in 2000 Google’
s qpms
in 2011
Old
Table
Layouts
New
Table
Layouts
Old File
Formats
New
File
Formats
Old Pattern of
Change in Data &
Business Processes
(Waterfall!)
New Pattern of
Change, Agile!
Data Problems in Advertising Examples
Our statistics at CPX…
• 5.5+ billion impressions per day
• 20+ billion bids per day
• 45+ billion segments per day
• 80+ columns per data stream
• Columns average 25+ bytes
That’s more than 100
Terabytes and 75 Billion
records every day! *
Website
Ad
Server
• Access Logs
• Cookies
• Interactions
…
• Predictions
• Demographi
cs
• Prices…
• Potential Buyers
• Market Trends
• Preferences…
Exchang
es
Bidders
• Bid Parameters
• Wins/Losses
• Ceilings/Floors
…
• Creatives
• Targeting
• Analytics…
• Bid Attempts
• Imp Value
• Market
Demand
• Revenue
• Costs
• Performance…
• Profit
• Winners
• Demographics
Life of a
Single Ad
Impressio
n
Data Problems in Music Examples
Our statistics at Warner…
• 5+ million radio plays daily
• 10+ million digital tracks sold
daily
• $1+ million in ecommerce daily
• 20+ million fans online
• 10-20 channels of interaction
with every fan
• Thousands of feeds of data that
could potentially mention a band
An essentially unlimited
supply of new data
streams with ever-
changing data formats!
1:07:01pm Radio Plays in Seattle:
MBUBLE HAVENT MET YOU YE
BUBLE´, MICHAEL
HAVEN’TMETYOUYET
1:07:02pm On Twitter in Cyberspace:
OMG this song is so sick! <3 #mbuble
#haventmet
This met you yet Bubble´ song makes me
sick.1:07:03pm On Website from China:
12 visitors to the website.
1:07:00pm On TV in New York:
Michael Buble´ appears on Oprah.
• First match up the many different
versions of the artist’s name
• Then Analyze sentiment to tell the
difference between uses of “sick”
• Then Augment sparse data streams
with useful dimensions (time,
location)
• Then decide how to correlate data!
Document
Store
Distributed File
System
Used for
unstructured fast-
flowing data
Massively
Parallel
MPPs, Used for structured high-
volume, high read + write data
Master/Sl
ave
Used for heavy read apps with normal
data volume & scale requirements
In-Memory
Columnar
Used for super
high speed read
only access to
cacheable data
Database Types
Document stores export into
Parallel and Master databases,
which cache into Columnar
databases.
Unstructured
Structured
Key-Value Pair
Used for semi-
structured fast-
flowing data
What we tried in Music Case Study
Roll-Your-Own approach
• Python + RabbitMQ +
MongoDB + PHP for
custom BI layer
• Custom development of
workflow, transformation,
storage, correlation,
smoothing, analysis
• Custom dev of dashboards,
reports, charts, etc for the
business
Why it didn’t work
• Bleeding edge technologies
were too immature and cost
of talent was too high
• Outsourced dev + insourced
support = fail
• Too much overhead to get a
usable product
What we’re doing in Advertising Case Study
Use-A-Stack approach
• Leverage a kick-start with
a stack that reduces
implementation time,
learning curve, and talent
costs
• Write pluggable modules
• Build the plan for multi-
layered data storage from
the beginning
Why it’s working
• By keeping our
investment and footprint
light, we’re able to
respond quickly to
changes in the industry &
technology ecosystem
• The multiple layers of
data are the key to
building products at scale
Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, Perl, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush
Make, MAMP, Confluence,
Agile / Scrum, SOASTA, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush,
Make, MAMP, Confluence,
Scrum, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
What does our stack look like?
We
only
build
the
red
stuff!
Hadoop Distributed File System
HDFS: Everybody’s Doing
It
– It’s just a file system!
– Feed it gzips, csvs,
whatever you’ve got
– Command line + library
interface to read/write files
to it
– Can be slow due to
replication across network
to data nodes
– Not much different than
sed/awk
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Name
Node
Parallel DBs RDBMS
The Holy Grail of
Database Scalability
Claims of database parallelization in the
past have been greatly exaggerated.
Nevertheless I believe we might be
dawning on a new era in this space.
Paths to parallelization
Sharding: Manual split of tables into
independent db instances. Joins across
dbs not possible without manual extract
and re-load into one instance.
Federation: Automatic split of tables
into independent db instances. Joins
across dbs managed by high-level
software layer that extracts data and
joins/merges outside the db instances.
Performance penalty in data
extraction/merge. Much redundant
work performed by each db instance in
parsing and compiling SQL.
True-MPP: Only one db instance, with
multiple compute/storage nodes. All
Joins across nodes are managed
natively by the execution engine within
the db instance. No redundant work
performed, no performance penalties.
Technique
:
Degree of Automation: Vendor: Price:
Manual Semi-
Manual
Fully
Auto
Sharding X Various… Low
Federation X GreenPlum High
True-MPP X Netezza High
True-MPP X XtremeData Low
* Thanks to Ravi the CTO of Xtreme Data for contributing this break-down!
Traditional DBs RDBMS
MySQL Is Your Best Friend
• Feeding it from a
warehouse is the hardest
part; needs workflow
software and reduce jobs
• Works great for read-
heavy web applications
• Cheap talent, cheap
hosting, tons of support
• Creativity required for
heavy writes, i.e. node.js
+ queuing mechanism
In-Memory Columnar DBs In-Memory
The New Memcached
• In memory DBs or
“Columnar” databases are
just key-value pairs:
put(‘name’, ‘value’)
• Some sophisticated layers
have been built on top to
turn it into near-SQL
• Crazy fast solution for read-
heavy systems like analytics
• Still needs workflow,
management, and a
traditional backend storage
system
Big Data Distribution Requirements Hosting
Massive Cheap Infrastructure
• Crazy virtual server farms!
1000+ servers get created and
destroyed to perform 1 job
• Automation and deployment of
these servers is crucial,
infrastructure automation is the
new hot skill
• Small-to-medium systems or
growing products use cloud
first and only invest in metal
once stabilized, and even then
it’s rarely cost effective
• Connections between the
servers drives the performance
of your data warehouse
solution!
Life in the Cloud:
It’s so different. Forget
everything you thought you
knew. Except Unix.
Major Bottlenecks:
• Reading & writing to disk: disks are
usually network-connected to cloud-
based servers
• Communicating with other servers
during replication; need to shave off
milliseconds with optimizations
• Staying ahead of storage space
limitations with archive jobs
• Partitioning large datasets based on
primary reduce filters
• Keeping up with your dataset when you
start to get behind
Coordination Strategies Implementation
Amazon Hosted Cloud
• It’s like a treasure hunt.*
Vsphere/Openstack Private Cloud
• Roll your own someday.
Cloudera Hadoop Management
• You will love life.
Chef Deployment Automation
• OMG life gets better.
* Note: Rackspace is also awesome.
Workflow Strategies Implementation
You Still Need Data In & Out
• Hive/Pig – You definitely need them
– SQL sits on top of Hadoop so you
can query flat files like a table!
– Outputs into RDBMS is easy, but
managing the jobs is hard
– Nobody wants to learn Map Reduce
• Custom Coding
– Long term supportability is low
– High cost & slow to market
• Cloudera has Workflow Services!
– Impala
– Flume
No one is really the standard in this space yet, although there are a lot
of really interesting players. Check out the Big Data chart for more!
The Big Data Adoption Problem Adoption
Problems:
• People don’t know what to do with the data
or how to gain insights
• The data changes too fast for traditional
software development; they don’t know what
they want until they see it, they can’t see it
until they tell you what they want!
• If it can’t feel the benefits of the
infrastructure, the business can’t continue to
invest in big data
Solutions:
• Open up windows into the workflow so
humans can dig around and discover things
in the data, teach everyone SQL
• Provide simple BI and visualization solutions
that don’t require custom development
• Support the classical Excel part of the
business world, and make your data
accessible in tabular exports
• Continue development on custom reporting
platforms, learning from the first three steps
along the way
Fancy Stuff Adoption
If you’ve come this far you can finally have…
• Statistical modeling
• Sentiment analysis
• Prediction algorithms
• Machine learning
• Mmmmmmm fun stuff…
BUT NOT UNTIL YOU CAN SUPPORT IT!!!! 
The Most Important Things to Know Cheatsheet
• It’s still all about the reads vs. writes
• HDFS is just a file system format for documents
• Hadoop is just for crunching and outputting into normal
databases, you don’t actually point an application at it
• MPPs are awesome and the wave of the future
• In-memory columnar databases are all the rage
(because they’re crazy fast) and will probably be a
requirement for all high-scale apps in the future
• Don’t forget to become awesome at Unix system &
network administration, because all the same commands
work in the cloud and it’s the only way to understand
what’s going on underneath the hood!
What to do right now Try it Out
• Download Openstack and install it on your laptop OR Register for
Amazon AWS
• In your new Cloud:
– Download & install Cloudera Community
– Spin up a few servers & add them to Cloudera
– Find Open Source Xtreme Data MPP in the Marketplace
• Get more advanced:
– Setup a Chef implementation, try automating a few server spin-
up & spin-downs
– Try the open source Druid in-memory DB
– Setup a node.js server w/ Express and pipe in some real-time
data
– Write a real-time data analytics front-end to see if it works!
• Where to get help?
– Forums are your best friend!
– IRC is your worst enemy but it’s still there for you!
– Wikipedia, Youtube, etc all have great resources to learn.

More Related Content

Viewers also liked

UW School of Medicine Social Engineering and Phishing Awareness
UW School of Medicine   Social Engineering and Phishing AwarenessUW School of Medicine   Social Engineering and Phishing Awareness
UW School of Medicine Social Engineering and Phishing Awareness
Nicholas Davis
 
Phreaks
PhreaksPhreaks
Phreaks
Fanap
 
Stuxnet - Case Study
Stuxnet  - Case StudyStuxnet  - Case Study
Stuxnet - Case Study
Amr Thabet
 
Ingénierie sociale
Ingénierie socialeIngénierie sociale
Ingénierie sociale
Habiba Kessraoui
 
Investigation de cybersécurité avec Splunk
Investigation de cybersécurité avec SplunkInvestigation de cybersécurité avec Splunk
Investigation de cybersécurité avec Splunk
Ibrahimous
 
Social engineering : l'art de l'influence et de la manipulation
Social engineering : l'art de l'influence et de la manipulationSocial engineering : l'art de l'influence et de la manipulation
Social engineering : l'art de l'influence et de la manipulation
Christophe Casalegno
 
Stuxnet worm
Stuxnet wormStuxnet worm
Stuxnet worm
sommerville-videos
 

Viewers also liked (7)

UW School of Medicine Social Engineering and Phishing Awareness
UW School of Medicine   Social Engineering and Phishing AwarenessUW School of Medicine   Social Engineering and Phishing Awareness
UW School of Medicine Social Engineering and Phishing Awareness
 
Phreaks
PhreaksPhreaks
Phreaks
 
Stuxnet - Case Study
Stuxnet  - Case StudyStuxnet  - Case Study
Stuxnet - Case Study
 
Ingénierie sociale
Ingénierie socialeIngénierie sociale
Ingénierie sociale
 
Investigation de cybersécurité avec Splunk
Investigation de cybersécurité avec SplunkInvestigation de cybersécurité avec Splunk
Investigation de cybersécurité avec Splunk
 
Social engineering : l'art de l'influence et de la manipulation
Social engineering : l'art de l'influence et de la manipulationSocial engineering : l'art de l'influence et de la manipulation
Social engineering : l'art de l'influence et de la manipulation
 
Stuxnet worm
Stuxnet wormStuxnet worm
Stuxnet worm
 

Similar to Hofstra University - Overview of Big Data

Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
Eugenio Minardi
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
Grega Kespret
 
The New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationThe New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data Exploration
Inside Analysis
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
Inside Analysis
 
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
Bogdan Bocse
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
SoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in UtahSoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in Utah
Serhiy (Serge) Haziyev
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
Inside Analysis
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
Davide Mauri
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PM
Product School
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
Inside Analysis
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
 
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...
DevOpsDays Tel Aviv
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 

Similar to Hofstra University - Overview of Big Data (20)

Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
The New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationThe New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data Exploration
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
SoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in UtahSoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in Utah
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PM
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 

Recently uploaded

[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
MastanaihnaiduYasam
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 

Recently uploaded (20)

[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 

Hofstra University - Overview of Big Data

  • 1. Big Data: Technologies & Challenges Facing Business Today A Practical Guide to Getting Started
  • 2. My Name is Sara Robertson Hello About Me • I’m the VP of Technology at CPX Interactive. • Previously ran the platform team at Warner Music Group. • I choose technology based on cost, efficiency, availability of talent, valuation potential, long term outlook, community support, and other reasons besides just pure tech. • I love agile, open source, bio- hacking, anime and martial arts movies, emo, audiobooks. • Favorite tech skills: optimization, debugging. About Me + Data • Started at a mainframe company doing Nagios-style server & network monitoring. • Spent my early years obsessed with Oracle, then Postgres, then MySQL. • Most of my data experience is in either high-traffic web applications or back-office data warehousing. • I believe that data is about humans as much as it is about technology, and if your solution doesn’t speak to your users then it’s not really a solution.
  • 4. 8/14/2014 4 Big Data We’ll Talk About Today
  • 5. Four Problems With Data Problems Too Big Too Fast Too Disparate Too Unwieldy system creative_id cpx_creative_name viq_creative_name alt_creative_id alt_creative_name size file_size click_track audit_status type li_number li_name camp_number camp_name viq_placement_id x_cslookup bigint string tinyint smallint smallint tinyint tinyint string tinyint tinyint float tinyint float float float float int tinyint tinyint bigint string string string string tinyint tinyint tinyint int int int int string int int float float float int string string float float float int int int int int int int int int int int float tinyint int tinyint tinyint int int int tinyint int int float float string string string float float float float float float float float string string float hourlycpxDataSiphonHour.py auction_id_64 date_time user_tz_offset width height media_type fold_position event_type imp_type payment_type media_cost_dollars_cpm revenue_type buyer_spend buyer_bid ecp eap is_imp is_learn predict_type_rev othuser_id_64 ip_address ip_address_trunc geo_country geo_region operating_system browser language venue_id seller_member_id publisher_id site_id site_domain tag_id external_inv_id reserve_price seller_revenue_cpm media_buy_rev_share_pct pub_rule_id seller_currency publisher_currency publisher_exchange_rate serving_fees_cpm serving_fees_revshare buyer_member_id advertiser_id brand_id advertiser_frequency advertiser_recency insertion_order_id campaign_group_id campaign_id creative_id creative_freq creative_rec cadence_modifier can_convert user_group_id is_control control_pct control_creative_id is_click pixel_id is_remarketing post_click_conv post_view_conv post_click_revenue post_view_revenue order_id external_data pricing_type booked_revenue_dollars booked_revenue_adv_curr commission_cpm commission_revshare auction_service_deduction auction_service_fees creative_overage_fees clear_fees buyer_currency advertiser_currency advertiser_exchange_rate x_datasiphon Google’ s qpms in 2000 Google’ s qpms in 2011 Old Table Layouts New Table Layouts Old File Formats New File Formats Old Pattern of Change in Data & Business Processes (Waterfall!) New Pattern of Change, Agile!
  • 6. Data Problems in Advertising Examples Our statistics at CPX… • 5.5+ billion impressions per day • 20+ billion bids per day • 45+ billion segments per day • 80+ columns per data stream • Columns average 25+ bytes That’s more than 100 Terabytes and 75 Billion records every day! * Website Ad Server • Access Logs • Cookies • Interactions … • Predictions • Demographi cs • Prices… • Potential Buyers • Market Trends • Preferences… Exchang es Bidders • Bid Parameters • Wins/Losses • Ceilings/Floors … • Creatives • Targeting • Analytics… • Bid Attempts • Imp Value • Market Demand • Revenue • Costs • Performance… • Profit • Winners • Demographics Life of a Single Ad Impressio n
  • 7. Data Problems in Music Examples Our statistics at Warner… • 5+ million radio plays daily • 10+ million digital tracks sold daily • $1+ million in ecommerce daily • 20+ million fans online • 10-20 channels of interaction with every fan • Thousands of feeds of data that could potentially mention a band An essentially unlimited supply of new data streams with ever- changing data formats! 1:07:01pm Radio Plays in Seattle: MBUBLE HAVENT MET YOU YE BUBLE´, MICHAEL HAVEN’TMETYOUYET 1:07:02pm On Twitter in Cyberspace: OMG this song is so sick! <3 #mbuble #haventmet This met you yet Bubble´ song makes me sick.1:07:03pm On Website from China: 12 visitors to the website. 1:07:00pm On TV in New York: Michael Buble´ appears on Oprah. • First match up the many different versions of the artist’s name • Then Analyze sentiment to tell the difference between uses of “sick” • Then Augment sparse data streams with useful dimensions (time, location) • Then decide how to correlate data!
  • 8. Document Store Distributed File System Used for unstructured fast- flowing data Massively Parallel MPPs, Used for structured high- volume, high read + write data Master/Sl ave Used for heavy read apps with normal data volume & scale requirements In-Memory Columnar Used for super high speed read only access to cacheable data Database Types Document stores export into Parallel and Master databases, which cache into Columnar databases. Unstructured Structured Key-Value Pair Used for semi- structured fast- flowing data
  • 9.
  • 10. What we tried in Music Case Study Roll-Your-Own approach • Python + RabbitMQ + MongoDB + PHP for custom BI layer • Custom development of workflow, transformation, storage, correlation, smoothing, analysis • Custom dev of dashboards, reports, charts, etc for the business Why it didn’t work • Bleeding edge technologies were too immature and cost of talent was too high • Outsourced dev + insourced support = fail • Too much overhead to get a usable product
  • 11. What we’re doing in Advertising Case Study Use-A-Stack approach • Leverage a kick-start with a stack that reduces implementation time, learning curve, and talent costs • Write pluggable modules • Build the plan for multi- layered data storage from the beginning Why it’s working • By keeping our investment and footprint light, we’re able to respond quickly to changes in the industry & technology ecosystem • The multiple layers of data are the key to building products at scale
  • 12. Data War ehouse hosting web platfor m pr oducts Development 3r d par ty integr ations R&D Custom Modules Custom Themes Contr ib Modules Contr ib Themes Contr ib Cor e Custom Cor es Contr ib Libr ar ies Glue Code custom SQL Contr ib Ser vices Custom Code custom Scr ipts Paid Ser vices Contr ib Tools Drupal, Wordpress, PHP, Javascript, jQuery, HTML, CSS, Flash, Bash, Perl, etc... MySQL, PostgreSQL, Hadoop, Cloudera, Hive, Hue, Impala, Python, Java, SQL, etc... Amazon, Ubuntu, Apache2, Nginx, Node.js, Memcache, Highwinds CDN, etc… Appnexus, Right Media, Google, Salesforce, Zendesk, Microso , Chrome, Dropbox, etc... Git, VSphere, VMware, Drush Make, MAMP, Confluence, Agile / Scrum, SOASTA, etc... Mobile, Video, Bidders, IPs, Viewability, Emerging Tech... Data War ehouse hosting web platfor m pr oducts Development 3r d par ty integr ations R&D Custom Modules Custom Themes Contr ib Modules Contr ib Themes Contr ib Cor e Custom Cor es Contr ib Libr ar ies Glue Code custom SQL Contr ib Ser vices Custom Code custom Scr ipts Paid Ser vices Contr ib Tools Drupal, Wordpress, PHP, Javascript, jQuery, HTML, CSS, Flash, Bash, etc... MySQL, PostgreSQL, Hadoop, Cloudera, Hive, Hue, Impala, Python, Java, SQL, etc... Amazon, Ubuntu, Apache2, Nginx, Node.js, Memcache, Highwinds CDN, etc… Appnexus, Right Media, Google, Salesforce, Zendesk, Microso , Chrome, Dropbox, etc... Git, VSphere, VMware, Drush, Make, MAMP, Confluence, Scrum, etc... Mobile, Video, Bidders, IPs, Viewability, Emerging Tech... What does our stack look like? We only build the red stuff!
  • 13. Hadoop Distributed File System HDFS: Everybody’s Doing It – It’s just a file system! – Feed it gzips, csvs, whatever you’ve got – Command line + library interface to read/write files to it – Can be slow due to replication across network to data nodes – Not much different than sed/awk Data Node Data Node Data Node Data Node Data Node Data Node Name Node
  • 14. Parallel DBs RDBMS The Holy Grail of Database Scalability Claims of database parallelization in the past have been greatly exaggerated. Nevertheless I believe we might be dawning on a new era in this space. Paths to parallelization Sharding: Manual split of tables into independent db instances. Joins across dbs not possible without manual extract and re-load into one instance. Federation: Automatic split of tables into independent db instances. Joins across dbs managed by high-level software layer that extracts data and joins/merges outside the db instances. Performance penalty in data extraction/merge. Much redundant work performed by each db instance in parsing and compiling SQL. True-MPP: Only one db instance, with multiple compute/storage nodes. All Joins across nodes are managed natively by the execution engine within the db instance. No redundant work performed, no performance penalties. Technique : Degree of Automation: Vendor: Price: Manual Semi- Manual Fully Auto Sharding X Various… Low Federation X GreenPlum High True-MPP X Netezza High True-MPP X XtremeData Low * Thanks to Ravi the CTO of Xtreme Data for contributing this break-down!
  • 15. Traditional DBs RDBMS MySQL Is Your Best Friend • Feeding it from a warehouse is the hardest part; needs workflow software and reduce jobs • Works great for read- heavy web applications • Cheap talent, cheap hosting, tons of support • Creativity required for heavy writes, i.e. node.js + queuing mechanism
  • 16. In-Memory Columnar DBs In-Memory The New Memcached • In memory DBs or “Columnar” databases are just key-value pairs: put(‘name’, ‘value’) • Some sophisticated layers have been built on top to turn it into near-SQL • Crazy fast solution for read- heavy systems like analytics • Still needs workflow, management, and a traditional backend storage system
  • 17. Big Data Distribution Requirements Hosting Massive Cheap Infrastructure • Crazy virtual server farms! 1000+ servers get created and destroyed to perform 1 job • Automation and deployment of these servers is crucial, infrastructure automation is the new hot skill • Small-to-medium systems or growing products use cloud first and only invest in metal once stabilized, and even then it’s rarely cost effective • Connections between the servers drives the performance of your data warehouse solution! Life in the Cloud: It’s so different. Forget everything you thought you knew. Except Unix. Major Bottlenecks: • Reading & writing to disk: disks are usually network-connected to cloud- based servers • Communicating with other servers during replication; need to shave off milliseconds with optimizations • Staying ahead of storage space limitations with archive jobs • Partitioning large datasets based on primary reduce filters • Keeping up with your dataset when you start to get behind
  • 18. Coordination Strategies Implementation Amazon Hosted Cloud • It’s like a treasure hunt.* Vsphere/Openstack Private Cloud • Roll your own someday. Cloudera Hadoop Management • You will love life. Chef Deployment Automation • OMG life gets better. * Note: Rackspace is also awesome.
  • 19. Workflow Strategies Implementation You Still Need Data In & Out • Hive/Pig – You definitely need them – SQL sits on top of Hadoop so you can query flat files like a table! – Outputs into RDBMS is easy, but managing the jobs is hard – Nobody wants to learn Map Reduce • Custom Coding – Long term supportability is low – High cost & slow to market • Cloudera has Workflow Services! – Impala – Flume No one is really the standard in this space yet, although there are a lot of really interesting players. Check out the Big Data chart for more!
  • 20. The Big Data Adoption Problem Adoption Problems: • People don’t know what to do with the data or how to gain insights • The data changes too fast for traditional software development; they don’t know what they want until they see it, they can’t see it until they tell you what they want! • If it can’t feel the benefits of the infrastructure, the business can’t continue to invest in big data Solutions: • Open up windows into the workflow so humans can dig around and discover things in the data, teach everyone SQL • Provide simple BI and visualization solutions that don’t require custom development • Support the classical Excel part of the business world, and make your data accessible in tabular exports • Continue development on custom reporting platforms, learning from the first three steps along the way
  • 21. Fancy Stuff Adoption If you’ve come this far you can finally have… • Statistical modeling • Sentiment analysis • Prediction algorithms • Machine learning • Mmmmmmm fun stuff… BUT NOT UNTIL YOU CAN SUPPORT IT!!!! 
  • 22. The Most Important Things to Know Cheatsheet • It’s still all about the reads vs. writes • HDFS is just a file system format for documents • Hadoop is just for crunching and outputting into normal databases, you don’t actually point an application at it • MPPs are awesome and the wave of the future • In-memory columnar databases are all the rage (because they’re crazy fast) and will probably be a requirement for all high-scale apps in the future • Don’t forget to become awesome at Unix system & network administration, because all the same commands work in the cloud and it’s the only way to understand what’s going on underneath the hood!
  • 23. What to do right now Try it Out • Download Openstack and install it on your laptop OR Register for Amazon AWS • In your new Cloud: – Download & install Cloudera Community – Spin up a few servers & add them to Cloudera – Find Open Source Xtreme Data MPP in the Marketplace • Get more advanced: – Setup a Chef implementation, try automating a few server spin- up & spin-downs – Try the open source Druid in-memory DB – Setup a node.js server w/ Express and pipe in some real-time data – Write a real-time data analytics front-end to see if it works! • Where to get help? – Forums are your best friend! – IRC is your worst enemy but it’s still there for you! – Wikipedia, Youtube, etc all have great resources to learn.

Editor's Notes

  1. Would Michael Buble stepping on the stage to an east coast audience inspire 12 visitors to the website from china 3 seconds later???
  2. It’s the same as a decision to use a web framework instead of write a new session handler over and over… who wants to write workflow & job management solutions from scratch?