SlideShare a Scribd company logo

Beyond a Big Data Pilot: Building a Production Data Infrastructure - StampedeCon 2014

This document discusses building a production data infrastructure beyond a big data pilot project. It examines the data value chain from data acquisition to analytics. The key components discussed include data acquisition, ingestion, storage, data services, analytics, and data management. Various options for these components are explored, with considerations for batch, interactive and real-time workloads. The goal is to provide a framework for understanding the options and making choices to support different use cases at scale in a production environment.

1 of 27
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
Beyond a Big Data Pilot:
Building a Production Data Infrastructure
StampedeCon
29 May 2014, St. Louis
Stephen O’Sullivan (@steveos)
strata.svds.com @SVDataScience
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
2
Stephen O’Sullivan
Distinguished Architect
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
Beyond a Big Data
Pilot:
Building a Production Data
Infrastructure
Creating a data architecture involves many moving parts. By
examining the data value chain, from ingestion through to analytics,
we will explain how the various parts of the Hadoop and big data
ecosystem fit together to support batch, interactive and realtime
analytical workloads.
By tracing the flow of data from source to output, we’ll explore the
options and considerations for components, including data
acquisition, ingestion, storage, data services, analytics and data
management. Most importantly, we’ll leave you with a framework for
understanding these options and making choices.
3
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
4
Key-Value
Columnar
Graph
Document
GENERAL
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
5
UP OR OUT? Different use cases put different
demands on the data
infrastructure
• UC1
• UC2
• UC3
• UC4
• UCn
Increasing cost per unit of
capability from scale-up
architectures causes rationing of
resources. Only the most valuable
use cases are pursued.
Data Resource Usage
Value
scale-out
cost
UC 1 UC2 UC3 UC4
UCn
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
6
THE DATA VALUE CHAIN
Acquire Ingest Process Persist Integrate Analyze Expose
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
7
BUILDING A
DATA
PLATFORM
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
8
Acquisition:
from internal and external data sources
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
9
Ingestion
offline and real-time Processing
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
10
Persistence
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
11
Data Services
Exposing data to applications
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Service
s
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
12
Analytics
batch and real-time processing
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
Data
Management
Data security, operations,
lineage, quality, and metadata
management
13
External
Systems
Data
Acquisition
Internal
Data
Sources
Data Management
Security, Operations, Data Quality, Meta Data Management and Data Lineage
Analytics
Data
Ingestion
Data
Repository
External
Data
Sources
Persistence
Offline
Processing
Real Time
Processing
Batch
Processing
Data
Service
s
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
Use Case • Collection in-store sales
transactions in near real-time
• Provide near real-time
dashboards of sales transaction
(roll up by store, region etc)
• Provide ad-hoc access to this
data as soon as its collected (ie
low latency, and fine grain)
14
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
15
APPLICATION SERVERS
DATA
CENTER
A
DATA
CENTER
B
BI Server
http
BI Server
http
FORTUNE 500
RETAIL COMPANY
Enabling Near Real-
time Sales
Transactions
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
16
APPLICATION SERVERS
DATA
CENTER
A
DATA
CENTER
B
CFS
BI Server
http
CFS
BI Server
http
FORTUNE 500
RETAIL COMPANY
Enabling Near Real-
time Sales
Transactions
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
• Data Acquisition
– Make sure you have the correct network
access in place
– Will depend on the data and your policies.
• Data Ingestion
– Make sure the solution you choose can
scale out. Apache flume is a good example
of this
– Make sure your not point to point. In Flume,
Storm, and Kafka you can configure forks
etc But you may need to handle duplicate
data
Ready to go into Production?
• Data Acquisition
– Can you see the “collectors” (internal or
external)?
– Do you need to encrypt the data (internally
or externally)?
• Data Ingestion
– Can you handle the traffic to the “collectors”?
– Redundant / self healing paths into the
cluster?
17
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
• Data Repository
– Make sure you have a way to address it, as
it will happen
– Hadoop and Cassandra makes it very easy
to add nodes. If you cannot add nodes be
prepared to drop data or stop processes or
both
– If’s its very wide data, and you query a
subset of the columns, Parquet would be a
good choice. If you would like to be able to
version your data schema, Avro is a good
choice.
• Data Services
– Build a restful service to access the data
– What is data resiliency I hear you ask..
Ready to go into Production?
• Data Repository
– Can you handle out of order data?
– Can you scale the cluster for data volume
spikes and/or processing spikes ?
– Should I just store plan text (compressed)?
• Data Services
– Do applications need to access this data?
– Do you have data resiliency?
18
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
19
Stovepipe:
One-to-one
relationship
from data
source to
product
Hard Failure:
If the data
source is
broken, so is
the app.
Multi-sourced:
Redundancy of
overlapping data
sources makes your
products more
resilient
Graceful Degradation:
If a data source
breaks, there is a
backup and your app
continues to function
Production data services
abstract the probabilistic
integration of overlapping
data sources. We call this
model a Data Mesh:
DATA RESILIENCY Products
Data
Sources
Broken
Data
Sources
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
• Analytics
– There are a few to choose from Hive,
Impala, Spark SQL, and HAWQ (and
growing). Share the same meta store.
Some are faster than others (depends on the
type of query)
– See if the current tool works with your distro.
You can also look at Platfora, Datameer, and
Karmasphere
– Yes you do, but the benefit is you still have
access to the raw data for the advance data
analyst or data science
– Now you have a data lake you can take
advantage of doing deep analytics on the
data without moving it out.
Ready to go into Production?
• Analytics
– Which is the right SQL on Hadoop solution
(for me)?
– Which BI tool should I use?
– Do I still need to set up business views of
the data?
– What about deep analytics?
20
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
21
Analytics
tools
Analytics
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
• Data Management
– It’s getting there.. At the query level using
Hive or Impala you can use Apache Sentry
or Apache Knox
– There are other 3rd party tools like Dataguise
that lets you do things like encryption at rest,
or masking
– Using “Fair Scheduler” will help you manage
your jobs SLA’s
– A 3rd party product by Pepper Data can help
with this too (and a little more)
Ready to go into Production?
• Data Management
– Security (who can see what?)
– Can you meet your SLA when other jobs /
queries are running?
– What monitoring do you have in place?
– Cluster failover?
22
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
Adding more
use cases
• I’m I duplicating data?
• Can I reuse the infrastructure
I’ve already created?
• Do have enough room in the
cluster (space/processing)?
• Will I impact the SLA’s of
jobs/queries currently running?
23
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
24
HIGH LEVEL
ARCHITECTURE
Oracle Stats
Collection
Pulling data over jdbc
Sending data to
Graphite Writing data to HDFS
Oracle Stats
CollectionOracle Stats
Collection
FORTUNE 500
RETAIL COMPANY
Enabling Real Time
Database Monitoring
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
25
FORTUNE 500
RETAIL COMPANY
Enabling Log
Collection & Search
statsd
http
APPLICATION SERVERS
Log Search
http
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
questions
26
Yes, We’re Hiring
svds.com/join-us
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
@SVDataScience
THANK YOU
Stephen O’Sullivan @steveos
27

Recommended

Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoopDr. Wilfred Lin (Ph.D.)
 
The curious case of data lake redemption
The curious case of data lake redemptionThe curious case of data lake redemption
The curious case of data lake redemptionDataWorks Summit
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 

More Related Content

What's hot

Actian forrester- hortonworks
Actian   forrester- hortonworksActian   forrester- hortonworks
Actian forrester- hortonworksHortonworks
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data DiscoveryHarald Erb
 
Developing a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceDeveloping a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceTony Baer
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraCloudera, Inc.
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyDataWorks Summit
 
Hadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHortonworks
 
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...DataWorks Summit/Hadoop Summit
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Technologies
 
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...Hortonworks
 
IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?Hortonworks
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Cloudera, Inc.
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOADemed L'Her
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Hortonworks
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data editionMark Kerzner
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseJeffrey T. Pollock
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataScott Clinton
 
8 from zero to insight with real time big data
8 from zero to insight with real time big data8 from zero to insight with real time big data
8 from zero to insight with real time big dataDr. Wilfred Lin (Ph.D.)
 
A Modern Data Strategy for Precision Medicine
A Modern Data Strategy for Precision MedicineA Modern Data Strategy for Precision Medicine
A Modern Data Strategy for Precision MedicineCloudera, Inc.
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 

What's hot (20)

Actian forrester- hortonworks
Actian   forrester- hortonworksActian   forrester- hortonworks
Actian forrester- hortonworks
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data Discovery
 
Developing a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceDeveloping a Strategy for Data Lake Governance
Developing a Strategy for Data Lake Governance
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Hadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHA
 
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
 
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
Leverage Big Data to Enhance Customer Experience in Telecommunications – with...
 
IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?IDC Retail Insights - What's Possible with a Modern Data Architecture?
IDC Retail Insights - What's Possible with a Modern Data Architecture?
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your data
 
8 from zero to insight with real time big data
8 from zero to insight with real time big data8 from zero to insight with real time big data
8 from zero to insight with real time big data
 
A Modern Data Strategy for Precision Medicine
A Modern Data Strategy for Precision MedicineA Modern Data Strategy for Precision Medicine
A Modern Data Strategy for Precision Medicine
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 

Viewers also liked

Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesJames Serra
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopOCTO Technology
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Modelnoahwong
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
IT Operating Model
IT Operating ModelIT Operating Model
IT Operating Modelanusharaju38
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsAki Balogh
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model DATUM LLC
 

Viewers also liked (13)

Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use Cases
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Model
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Unicom Big Data Conference
Unicom  Big Data ConferenceUnicom  Big Data Conference
Unicom Big Data Conference
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
IT Operating Model
IT Operating ModelIT Operating Model
IT Operating Model
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and Analytics
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model
 

Similar to Beyond a Big Data Pilot: Building a Production Data Infrastructure - StampedeCon 2014

Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 
The New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationThe New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationInside Analysis
 
Sqrrl March Webinar: How to Build a Big App
Sqrrl March Webinar: How to Build a Big AppSqrrl March Webinar: How to Build a Big App
Sqrrl March Webinar: How to Build a Big AppSqrrl
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Jeffrey T. Pollock
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...MongoDB
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Inside Analysis
 
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostHow to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostAtScale
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarHortonworks
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsJeffrey T. Pollock
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto MeetupHortonworks
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationInside Analysis
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data AnalyticsDatameer
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HPMITEF México
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 

Similar to Beyond a Big Data Pilot: Building a Production Data Infrastructure - StampedeCon 2014 (20)

Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
 
The New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data ExplorationThe New Frontier: Optimizing Big Data Exploration
The New Frontier: Optimizing Big Data Exploration
 
Sqrrl March Webinar: How to Build a Big App
Sqrrl March Webinar: How to Build a Big AppSqrrl March Webinar: How to Build a Big App
Sqrrl March Webinar: How to Build a Big App
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?
 
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostHow to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast Charts
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto Meetup
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HP
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...ShapeBlue
 
Communities, networking and developer culture
Communities, networking and developer cultureCommunities, networking and developer culture
Communities, networking and developer cultureRavi Sanghani
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024ThousandEyes
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueShapeBlue
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyMustafa Kuğu
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriSafe Software
 
AI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptxAI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptxUdaiappa Ramachandran
 
Achieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdfAchieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdfIES VE
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc
 
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfQ4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfTejal81
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubShapeBlue
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...2toLead Limited
 
Mastering Play Store App Listing and Optimization
Mastering Play Store App Listing and OptimizationMastering Play Store App Listing and Optimization
Mastering Play Store App Listing and OptimizationAppsthentic Technology
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientKari Kakkonen
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentThorsten Huelsmann
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarThousandEyes
 
eXtended Reality(XR) Basic introductions
eXtended Reality(XR) Basic introductionseXtended Reality(XR) Basic introductions
eXtended Reality(XR) Basic introductionsElanthirayan Madhavan
 
Java Optional (Kitworks Team Study 김성호 발표)
Java Optional (Kitworks Team Study 김성호 발표)Java Optional (Kitworks Team Study 김성호 발표)
Java Optional (Kitworks Team Study 김성호 발표)Wonjun Hwang
 

Recently uploaded (20)

What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
Communities, networking and developer culture
Communities, networking and developer cultureCommunities, networking and developer culture
Communities, networking and developer culture
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5Company
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & Esri
 
AI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptxAI-Plugins-Planners-Persona-SemanticKernel.pptx
AI-Plugins-Planners-Persona-SemanticKernel.pptx
 
Achieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdfAchieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdf
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI Innovations
 
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfQ4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
 
Mastering Play Store App Listing and Optimization
Mastering Play Store App Listing and OptimizationMastering Play Store App Listing and Optimization
Mastering Play Store App Listing and Optimization
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficient
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry development
 
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes Webinar
 
eXtended Reality(XR) Basic introductions
eXtended Reality(XR) Basic introductionseXtended Reality(XR) Basic introductions
eXtended Reality(XR) Basic introductions
 
Java Optional (Kitworks Team Study 김성호 발표)
Java Optional (Kitworks Team Study 김성호 발표)Java Optional (Kitworks Team Study 김성호 발표)
Java Optional (Kitworks Team Study 김성호 발표)
 

Beyond a Big Data Pilot: Building a Production Data Infrastructure - StampedeCon 2014

  • 1. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Beyond a Big Data Pilot: Building a Production Data Infrastructure StampedeCon 29 May 2014, St. Louis Stephen O’Sullivan (@steveos) strata.svds.com @SVDataScience
  • 2. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 2 Stephen O’Sullivan Distinguished Architect
  • 3. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Beyond a Big Data Pilot: Building a Production Data Infrastructure Creating a data architecture involves many moving parts. By examining the data value chain, from ingestion through to analytics, we will explain how the various parts of the Hadoop and big data ecosystem fit together to support batch, interactive and realtime analytical workloads. By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including data acquisition, ingestion, storage, data services, analytics and data management. Most importantly, we’ll leave you with a framework for understanding these options and making choices. 3
  • 4. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 4 Key-Value Columnar Graph Document GENERAL
  • 5. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 5 UP OR OUT? Different use cases put different demands on the data infrastructure • UC1 • UC2 • UC3 • UC4 • UCn Increasing cost per unit of capability from scale-up architectures causes rationing of resources. Only the most valuable use cases are pursued. Data Resource Usage Value scale-out cost UC 1 UC2 UC3 UC4 UCn
  • 6. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 6 THE DATA VALUE CHAIN Acquire Ingest Process Persist Integrate Analyze Expose
  • 7. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 7 BUILDING A DATA PLATFORM External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 8. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 8 Acquisition: from internal and external data sources External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 9. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 9 Ingestion offline and real-time Processing External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 10. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 10 Persistence External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 11. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 11 Data Services Exposing data to applications External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Service s
  • 12. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 12 Analytics batch and real-time processing External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 13. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Data Management Data security, operations, lineage, quality, and metadata management 13 External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Service s
  • 14. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Use Case • Collection in-store sales transactions in near real-time • Provide near real-time dashboards of sales transaction (roll up by store, region etc) • Provide ad-hoc access to this data as soon as its collected (ie low latency, and fine grain) 14
  • 15. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 15 APPLICATION SERVERS DATA CENTER A DATA CENTER B BI Server http BI Server http FORTUNE 500 RETAIL COMPANY Enabling Near Real- time Sales Transactions
  • 16. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 16 APPLICATION SERVERS DATA CENTER A DATA CENTER B CFS BI Server http CFS BI Server http FORTUNE 500 RETAIL COMPANY Enabling Near Real- time Sales Transactions
  • 17. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Data Acquisition – Make sure you have the correct network access in place – Will depend on the data and your policies. • Data Ingestion – Make sure the solution you choose can scale out. Apache flume is a good example of this – Make sure your not point to point. In Flume, Storm, and Kafka you can configure forks etc But you may need to handle duplicate data Ready to go into Production? • Data Acquisition – Can you see the “collectors” (internal or external)? – Do you need to encrypt the data (internally or externally)? • Data Ingestion – Can you handle the traffic to the “collectors”? – Redundant / self healing paths into the cluster? 17
  • 18. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Data Repository – Make sure you have a way to address it, as it will happen – Hadoop and Cassandra makes it very easy to add nodes. If you cannot add nodes be prepared to drop data or stop processes or both – If’s its very wide data, and you query a subset of the columns, Parquet would be a good choice. If you would like to be able to version your data schema, Avro is a good choice. • Data Services – Build a restful service to access the data – What is data resiliency I hear you ask.. Ready to go into Production? • Data Repository – Can you handle out of order data? – Can you scale the cluster for data volume spikes and/or processing spikes ? – Should I just store plan text (compressed)? • Data Services – Do applications need to access this data? – Do you have data resiliency? 18
  • 19. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 19 Stovepipe: One-to-one relationship from data source to product Hard Failure: If the data source is broken, so is the app. Multi-sourced: Redundancy of overlapping data sources makes your products more resilient Graceful Degradation: If a data source breaks, there is a backup and your app continues to function Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh: DATA RESILIENCY Products Data Sources Broken Data Sources Data Services
  • 20. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Analytics – There are a few to choose from Hive, Impala, Spark SQL, and HAWQ (and growing). Share the same meta store. Some are faster than others (depends on the type of query) – See if the current tool works with your distro. You can also look at Platfora, Datameer, and Karmasphere – Yes you do, but the benefit is you still have access to the raw data for the advance data analyst or data science – Now you have a data lake you can take advantage of doing deep analytics on the data without moving it out. Ready to go into Production? • Analytics – Which is the right SQL on Hadoop solution (for me)? – Which BI tool should I use? – Do I still need to set up business views of the data? – What about deep analytics? 20
  • 21. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 21 Analytics tools Analytics
  • 22. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Data Management – It’s getting there.. At the query level using Hive or Impala you can use Apache Sentry or Apache Knox – There are other 3rd party tools like Dataguise that lets you do things like encryption at rest, or masking – Using “Fair Scheduler” will help you manage your jobs SLA’s – A 3rd party product by Pepper Data can help with this too (and a little more) Ready to go into Production? • Data Management – Security (who can see what?) – Can you meet your SLA when other jobs / queries are running? – What monitoring do you have in place? – Cluster failover? 22
  • 23. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Adding more use cases • I’m I duplicating data? • Can I reuse the infrastructure I’ve already created? • Do have enough room in the cluster (space/processing)? • Will I impact the SLA’s of jobs/queries currently running? 23
  • 24. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 24 HIGH LEVEL ARCHITECTURE Oracle Stats Collection Pulling data over jdbc Sending data to Graphite Writing data to HDFS Oracle Stats CollectionOracle Stats Collection FORTUNE 500 RETAIL COMPANY Enabling Real Time Database Monitoring
  • 25. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 25 FORTUNE 500 RETAIL COMPANY Enabling Log Collection & Search statsd http APPLICATION SERVERS Log Search http
  • 26. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience questions 26 Yes, We’re Hiring svds.com/join-us
  • 27. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience THANK YOU Stephen O’Sullivan @steveos 27