SlideShare a Scribd company logo
Hadoop & Data Science For The Enterprise
30 Tips & Tricks + Worksheets

https://www.slideshare.net/markslusar
@MarkSlusar
Allstate Insurance Company

© Allstate Insurance Company Proprietary and Confidential
Allstate: The Good Hands Company
The Allstate Corporation (NYSE: ALL) is the nation's largest publicly held
personal lines insurer.
Allstate provides insurance products to approximately 16 million households.
Allstate was founded in 1931 as part of Sears, Roebuck & Co.
Approximately: 38,600 Employees and 11,200 Agencies
Brands: Allstate, Esurance, Encompass, Answer Financial

Auto insurance, homeowners insurance, life insurance and investment products
including retirement planning, annuities and mutual funds.

2

October 25, 2013

Proprietary and Confidential
Mark Slusar
https://www.slideshare.net/markslusar
Part of Allstate Quantitative Research & Analytics
(AKA Data Science)

I really like Data…
Since „98 in the Workplace
Since „88 as a Geek
Early Hadoop Adopter @ Navteq & Nokia
Twitter @MarkSlusar
3

October 25, 2013

Proprietary and Confidential
1 / 30 Hadoop Loves ETL &
Datawarehouse Offloading
• Don‟t hyper-focus only on ETL and DW Offload
• Right now, 80% of data science isn‟t much science, it‟s
wrestling with data – Hadoop changes that.
• Hadoop rocks at ETL
(and is great for storage)
• You‟ll find yourself doing more T than E&L
• Build your analytics files faster, better, cheaper, and with
more flexibility

4

October 25, 2013

Proprietary and Confidential
2 / 30 Play the Right
Hadoop Data Science Game
• Descriptive (Easy)
• “What happened?”
• Predictive (Medium)
• “What will happen?”
• Prescriptive (Hard)
• “What should we do about it?”
• Batch, Ad Hoc, Real Time, Others

5

October 25, 2013

Proprietary and Confidential
3 / 30 Learn To Profile Effectively At Scale
• Get comfy with your data
• Use a Query tool (Hive, Impala, many others)
• If applicable, Use Search
• Use workflow systems
(Oozie, et al) for periodic
data collection and
pre-processing from
other operational systems.

10/25/2013

Proprietary and Confidential
4 / 30 Brace Yourself For Hadoop 2.0
•
•
•
•
•
•

Storm
HOYA (HBase on YARN)
Spark & associated projects
Giraph and similar
And More.. Everything gets better
Hurry Up, Get learning

10/25/2013

Proprietary and Confidential
5 / 30 Skills
•
•
•
•

Train (Private, Public, Free, Books)
Network (internets, msg boards)
Consultants
Inside your company: create your own internal user
group to share ideas
• Hadoop User groups (CHUG if you‟re in Chicago :)
(Find a HUG near you on meetup.com)

10/25/2013

Image
Credit: Yuko P

Proprietary and Confidential
6 / 30 Security
• File system, Kerberos
• Sentry, Knox, others
• Encryption (how much?)
• Vendors

• Your security organization will need
a Hadoop Intro, keep them in the loop

10/25/2013

Proprietary and Confidential
7 / 30 Use Other Platforms As Needed
• Outside of *gasp* Hadoop!!!
Hadoop is not solution for everything..
• With Existing platforms,
Compare & contrast:
• Cost
• Performance
• Maintenance
• Scalability
• Extensibility, Reliability,
High Availability, et al

10/25/2013

Proprietary and Confidential
8 / 30 Understand Analytics & Business
• Re-learn BI tools as needed
• Finance & Accounting Foundations
• There‟s a lot of tools out there: Many of them are
throwing their hat into the ring
• Great existing connectors to Hadoop
• Think different from traditional way. Adopt open
source.

10/25/2013

Proprietary and Confidential
9 / 30 Use Sqoop, Use Flume
•
•
•
•
•
•
•
•

Time savers
Beware of over-usage, start small
Consider querying „idle‟ backup environments (like DR, disaster
recovery if permitted)
Some DBAs may initially dislike Sqoop
Use appropriate connection. (i.e. OraOop)
Understand the nature of the data, relationships, deltas
Avoid a “Ha-Dump” (loading data in for no reason)
Use backup servers when possible, don‟t hammer prod servers

10/25/2013

Proprietary and Confidential
10 / 30 Learn Python
• Write less code, Do more, faster
• http://learnpythonthehardway.org
• Great starting point
• Use Python with
Hadoop Streaming

10/25/2013

Proprietary and Confidential
11 / 30 Learn Python Modules
•
•
•
•
•

NumPy & SciPy (math)
Scikit-Learn (ML)
Pandas (data)
Text Mining (NLTK, NLP et al)
Python Version(s) 2.7X or 3? YMMV, not everything
is working on 3 yet

10/25/2013

Proprietary and Confidential
12 / 30 Learn R
• Use & Learn R packages,
huge time-savers
• Use CRAN, its great & free

• Consider a supported
distribution:
(Oracle, Tibco, Revolution, et al)
• Not everything can effectively run in parallel, some
things are actually SLOWER on Hadoop

10/25/2013

Proprietary and Confidential
13 / 30 Admin
Treat the environment as a research tool as long as
possible – keep administrative channels open

Check your config files into version control – Check
everything into version control

Hadoop 2.0 performance management

10/25/2013

Proprietary and Confidential
14 / 30 Back it up?
•
•
•
•

Yes? No? Sometimes?
Use HDFS as your system of record?
Use another cluster made for archival? Appliance?
Tape is pennies per GB!

10/25/2013

Proprietary and Confidential
15 / 30 Advanced Predictive Modeling
• Understand what algorithms can & cannot be run in
parallel (ever?)
• This can quickly get complex

• Consider single “big boxes”
when needed (no Hadoop)
• GPUs are still relevant
• Bonus Points: GPUs in your Cluster
10/25/2013

Proprietary and Confidential
16 / 30 Get Comfy Streaming
• Quick, effective, useful
• You might be able to port old code (anything that
can write to stdin & read from stdout)
• Your port may need some tweaking for Map/Reduce
• Stream with Pig & Hive when appropriate

10/25/2013

Proprietary and Confidential
17 / 30 Use Hive & Pig
• Write your own Hive UDFs
• Write your own Pig UDFs
• Consider writing UDAFs (aggregators) and UDTFs
(transforms)

10/25/2013

Proprietary and Confidential
18 / 30 Learn The Enterprise Packages
• It‟s not just about open source
• Make sure you get what you pay for
Analogy:

Commercial &
Proprietary

Open Source &
Standardized?

10/25/2013

Proprietary and Confidential
19 / 30 Get Ready For YARNtacular Analytics
Examples: 0xdata &Skytree
Others: great things to come!

Image credit hortonworks

10/25/2013

Proprietary and Confidential
20 / 30 Know Your Data (Intimately)
•
•
•
•
•
•

Once you know it, re-learn it
Peer review your work
Don‟t forget to quality check on raw.
Quality check first, Analysis second
Understand how Nulls work / don‟t work
Get comfortable
with Metadata tools
(HCatalog for example)

10/25/2013

Proprietary and Confidential
21 / 30 Compliment Your Data
• Find More
• Co-mingle new “big” sources
• JOINs can be hard: Blending is an
Art and a Science
• Use specialized joins when joining small data sets.
Example: Map-Side joins

• Seek Corroboration among sources
• Build new between structured & unstructured

10/25/2013

Proprietary and Confidential
22 / 30 Get The Math & Stats Expertise
• Learn it; Hire it; Train it
• Understand it, Use it, Profit
Common
Sense & Hadoop 
Math &
Stats

Domain
Expertise

Coding

10/25/2013

Inquisitiveness
Proprietary and Confidential
23 / 30 Get Down With The Graph
• Learn about linked data
• Use Hadoop to build graphs, query and analyze
graphs
• Batch vs. Ad Hoc

10/25/2013

Proprietary and Confidential
24 / 30 Go Jump In A Lake
A data lake that is..

• Don‟t call it a mainframe, warehouse, data mart, etc.
• Consider use cases & security vs. traditional
approaches

10/25/2013

Proprietary and Confidential
25 / 30 Mahout is “in”
• Use it first, but there‟s much more beyond it
• Outside of Mahout, try building the models yourself
(Streaming, R, or Java)

10/25/2013

Proprietary and Confidential
26 / 30 Don‟t Be Afraid to Flatten Data
• Going from RDMS to Hadoop:
• Don‟t dread De-normalization
• For good?
Probably Not…

10/25/2013

Proprietary and Confidential
27 / 30 Use “Hadoop beat ABC by 400x” Sparingly

Everyone will get the point:
“A big cluster can totally
whomp on your other systems”

Be nice.

10/25/2013

10

8

Proprietary and Confidential
28 / 30 Ask Questions Of Data
Ask old questions previously unanswerable
• Depth? Breadth?
• Scale? Detail?
Ask new questions:
previously unthinkable

10/25/2013

Proprietary and Confidential
29 / 30 Data Science Is Science

Response Time is the most important part
of any data science platform‟s SLA

Think of Pasteur‟s Quadrant..
* Seek Understanding of Data
* Seek Practical Use of Data
Your Lab
* The Lab is not the Factory
* The Factory is not the Lab

Applied and Basic research

Quest for
fundamental
understanding
?

Yes

No

Pure basic
research
(Bohr)

Use-inspired
basic research
(Pasteur)

–

Pure applied
research
(Edison)
No

Yes

Considerations of use?

10/25/2013

Proprietary and Confidential
30 / 30 Don‟t Forget Visualization
• Tools (commercial & open source)
Too Many to mention!
• Query tools + Query Engines = Awesome

10/25/2013

Proprietary and Confidential
31 / 30….. Have Fun!

https://www.slideshare.net/markslusar
For High Level Use Case Worksheets
Huge Thanks to the Organizers! O’Reilly & Cloudera

Contact me @MarkSlusar
Allstate is always interested in Data Scientists & Engineers!
Contact me or visit: http://careers.allstate.com/

10/25/2013

Proprietary and Confidential
Worksheet #1 Hadoop Use Cases
Determine Use Cases, Example Below:
• ETL
• Extremely Responsive & Nimble Collection of tools & APIs:
Hive, Pig, Streaming API (Python, et al)
• Descriptive Analytics (aka BI)
• Using built-in tools (Hive, Pig, Streaming API)
• Using COTS tools (Commercial & Open) with streaming API & query engines
(Impala, Hive, et al)
• Predictive Analytics
• Using tools like R (streaming) and Python (numpy, scipy, scikit, & anaconda
over streaming)
• Storage & Archival
• Very low cost, highly fault-tolerant, very responsive
• {{ And more, YMMV }}

10/25/2013

Proprietary and Confidential
Worksheet #2 Data Science Ops
Determine Ops Usage, Example Below:
• Ad-Hoc Operations: One-off transactions
•

Sustainment Operations: A repeatable & trusted process

•

Research Operations:
Trying new queries, software, approaches, methods

•

Development Operations: Creating a Defined Operational Process for
Sustainment

•

Test Operations: Validating Data Quality, Consistency, Speed, Coverage, et al

•

Governance Operations: Validating Security Permissions, Lineage, Usage,
Importance, De-Duplication.

•

{{ And more, YMMV }}

10/25/2013

Proprietary and Confidential
Worksheet #3
Crossing “Hadoop Use Cases”
with the “Ops Usage”

Your Outcome may vary…
Storage &
Archival

ETL

Descriptive
Analytics

Predictive
Analytics

Ad Hoc Ops

N/A

Analysts

Data Science

Data Science

Sustainment
Ops

Data
Management

Data
Management

Analysts And
Data
Management

Data Science

Research Ops

Data Science

Data Science

Data Science

Data Science

Development
Ops

N/A

Data
Management

Data Science

Data Science

Test Ops

Data
Stewardship

Data
Stewardship

Data Science

Data Science

Governance
Ops

Data
Stewardship

Data
Stewardship

Data
Stewardship

Data
Stewardship

10/25/2013

Proprietary and Confidential
Worksheet #4
Crossing “Hadoop Use Cases”
with your Organization
Your Outcome may vary…
Storage &
Archival
Research

ETL
Offload

Descriptiv
e
Analytics

Predictive
Analytics

X

X

X

X

X

X

X

X

X

X

X

Marketing

Sales &
Pricing
IT Ops

X

X

Delivery

X

X

Other
Other

Other
10/25/2013

Proprietary and Confidential

More Related Content

Viewers also liked

Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
Olesya Eidam
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
University College Cork
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 

Viewers also liked (20)

Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference Oct 29 2013)

This Ain't Your Parents' Search Engine
This Ain't Your Parents' Search EngineThis Ain't Your Parents' Search Engine
This Ain't Your Parents' Search Engine
Lucidworks
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Oracle big data discovery 994294
Oracle big data discovery   994294Oracle big data discovery   994294
Oracle big data discovery 994294
Edgar Alejandro Villegas
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
Adam Doyle
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
Rachel Berryman
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101
Adam Doyle
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
Dieter De Witte
 
(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek
Ratko Mutavdzic
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
National Information Standards Organization (NISO)
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
Cloudera, Inc.
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
Davide Mauri
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
GigaScience, BGI Hong Kong
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Coping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited StorageCoping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited Storage
Globus
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
Jeffrey T. Pollock
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 

Similar to Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference Oct 29 2013) (20)

This Ain't Your Parents' Search Engine
This Ain't Your Parents' Search EngineThis Ain't Your Parents' Search Engine
This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Oracle big data discovery 994294
Oracle big data discovery   994294Oracle big data discovery   994294
Oracle big data discovery 994294
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Coping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited StorageCoping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited Storage
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 

Recently uploaded

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 

Recently uploaded (20)

Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 

Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference Oct 29 2013)

  • 1. Hadoop & Data Science For The Enterprise 30 Tips & Tricks + Worksheets https://www.slideshare.net/markslusar @MarkSlusar Allstate Insurance Company © Allstate Insurance Company Proprietary and Confidential
  • 2. Allstate: The Good Hands Company The Allstate Corporation (NYSE: ALL) is the nation's largest publicly held personal lines insurer. Allstate provides insurance products to approximately 16 million households. Allstate was founded in 1931 as part of Sears, Roebuck & Co. Approximately: 38,600 Employees and 11,200 Agencies Brands: Allstate, Esurance, Encompass, Answer Financial Auto insurance, homeowners insurance, life insurance and investment products including retirement planning, annuities and mutual funds. 2 October 25, 2013 Proprietary and Confidential
  • 3. Mark Slusar https://www.slideshare.net/markslusar Part of Allstate Quantitative Research & Analytics (AKA Data Science) I really like Data… Since „98 in the Workplace Since „88 as a Geek Early Hadoop Adopter @ Navteq & Nokia Twitter @MarkSlusar 3 October 25, 2013 Proprietary and Confidential
  • 4. 1 / 30 Hadoop Loves ETL & Datawarehouse Offloading • Don‟t hyper-focus only on ETL and DW Offload • Right now, 80% of data science isn‟t much science, it‟s wrestling with data – Hadoop changes that. • Hadoop rocks at ETL (and is great for storage) • You‟ll find yourself doing more T than E&L • Build your analytics files faster, better, cheaper, and with more flexibility 4 October 25, 2013 Proprietary and Confidential
  • 5. 2 / 30 Play the Right Hadoop Data Science Game • Descriptive (Easy) • “What happened?” • Predictive (Medium) • “What will happen?” • Prescriptive (Hard) • “What should we do about it?” • Batch, Ad Hoc, Real Time, Others 5 October 25, 2013 Proprietary and Confidential
  • 6. 3 / 30 Learn To Profile Effectively At Scale • Get comfy with your data • Use a Query tool (Hive, Impala, many others) • If applicable, Use Search • Use workflow systems (Oozie, et al) for periodic data collection and pre-processing from other operational systems. 10/25/2013 Proprietary and Confidential
  • 7. 4 / 30 Brace Yourself For Hadoop 2.0 • • • • • • Storm HOYA (HBase on YARN) Spark & associated projects Giraph and similar And More.. Everything gets better Hurry Up, Get learning 10/25/2013 Proprietary and Confidential
  • 8. 5 / 30 Skills • • • • Train (Private, Public, Free, Books) Network (internets, msg boards) Consultants Inside your company: create your own internal user group to share ideas • Hadoop User groups (CHUG if you‟re in Chicago :) (Find a HUG near you on meetup.com) 10/25/2013 Image Credit: Yuko P Proprietary and Confidential
  • 9. 6 / 30 Security • File system, Kerberos • Sentry, Knox, others • Encryption (how much?) • Vendors • Your security organization will need a Hadoop Intro, keep them in the loop 10/25/2013 Proprietary and Confidential
  • 10. 7 / 30 Use Other Platforms As Needed • Outside of *gasp* Hadoop!!! Hadoop is not solution for everything.. • With Existing platforms, Compare & contrast: • Cost • Performance • Maintenance • Scalability • Extensibility, Reliability, High Availability, et al 10/25/2013 Proprietary and Confidential
  • 11. 8 / 30 Understand Analytics & Business • Re-learn BI tools as needed • Finance & Accounting Foundations • There‟s a lot of tools out there: Many of them are throwing their hat into the ring • Great existing connectors to Hadoop • Think different from traditional way. Adopt open source. 10/25/2013 Proprietary and Confidential
  • 12. 9 / 30 Use Sqoop, Use Flume • • • • • • • • Time savers Beware of over-usage, start small Consider querying „idle‟ backup environments (like DR, disaster recovery if permitted) Some DBAs may initially dislike Sqoop Use appropriate connection. (i.e. OraOop) Understand the nature of the data, relationships, deltas Avoid a “Ha-Dump” (loading data in for no reason) Use backup servers when possible, don‟t hammer prod servers 10/25/2013 Proprietary and Confidential
  • 13. 10 / 30 Learn Python • Write less code, Do more, faster • http://learnpythonthehardway.org • Great starting point • Use Python with Hadoop Streaming 10/25/2013 Proprietary and Confidential
  • 14. 11 / 30 Learn Python Modules • • • • • NumPy & SciPy (math) Scikit-Learn (ML) Pandas (data) Text Mining (NLTK, NLP et al) Python Version(s) 2.7X or 3? YMMV, not everything is working on 3 yet 10/25/2013 Proprietary and Confidential
  • 15. 12 / 30 Learn R • Use & Learn R packages, huge time-savers • Use CRAN, its great & free • Consider a supported distribution: (Oracle, Tibco, Revolution, et al) • Not everything can effectively run in parallel, some things are actually SLOWER on Hadoop 10/25/2013 Proprietary and Confidential
  • 16. 13 / 30 Admin Treat the environment as a research tool as long as possible – keep administrative channels open Check your config files into version control – Check everything into version control Hadoop 2.0 performance management 10/25/2013 Proprietary and Confidential
  • 17. 14 / 30 Back it up? • • • • Yes? No? Sometimes? Use HDFS as your system of record? Use another cluster made for archival? Appliance? Tape is pennies per GB! 10/25/2013 Proprietary and Confidential
  • 18. 15 / 30 Advanced Predictive Modeling • Understand what algorithms can & cannot be run in parallel (ever?) • This can quickly get complex • Consider single “big boxes” when needed (no Hadoop) • GPUs are still relevant • Bonus Points: GPUs in your Cluster 10/25/2013 Proprietary and Confidential
  • 19. 16 / 30 Get Comfy Streaming • Quick, effective, useful • You might be able to port old code (anything that can write to stdin & read from stdout) • Your port may need some tweaking for Map/Reduce • Stream with Pig & Hive when appropriate 10/25/2013 Proprietary and Confidential
  • 20. 17 / 30 Use Hive & Pig • Write your own Hive UDFs • Write your own Pig UDFs • Consider writing UDAFs (aggregators) and UDTFs (transforms) 10/25/2013 Proprietary and Confidential
  • 21. 18 / 30 Learn The Enterprise Packages • It‟s not just about open source • Make sure you get what you pay for Analogy: Commercial & Proprietary Open Source & Standardized? 10/25/2013 Proprietary and Confidential
  • 22. 19 / 30 Get Ready For YARNtacular Analytics Examples: 0xdata &Skytree Others: great things to come! Image credit hortonworks 10/25/2013 Proprietary and Confidential
  • 23. 20 / 30 Know Your Data (Intimately) • • • • • • Once you know it, re-learn it Peer review your work Don‟t forget to quality check on raw. Quality check first, Analysis second Understand how Nulls work / don‟t work Get comfortable with Metadata tools (HCatalog for example) 10/25/2013 Proprietary and Confidential
  • 24. 21 / 30 Compliment Your Data • Find More • Co-mingle new “big” sources • JOINs can be hard: Blending is an Art and a Science • Use specialized joins when joining small data sets. Example: Map-Side joins • Seek Corroboration among sources • Build new between structured & unstructured 10/25/2013 Proprietary and Confidential
  • 25. 22 / 30 Get The Math & Stats Expertise • Learn it; Hire it; Train it • Understand it, Use it, Profit Common Sense & Hadoop  Math & Stats Domain Expertise Coding 10/25/2013 Inquisitiveness Proprietary and Confidential
  • 26. 23 / 30 Get Down With The Graph • Learn about linked data • Use Hadoop to build graphs, query and analyze graphs • Batch vs. Ad Hoc 10/25/2013 Proprietary and Confidential
  • 27. 24 / 30 Go Jump In A Lake A data lake that is.. • Don‟t call it a mainframe, warehouse, data mart, etc. • Consider use cases & security vs. traditional approaches 10/25/2013 Proprietary and Confidential
  • 28. 25 / 30 Mahout is “in” • Use it first, but there‟s much more beyond it • Outside of Mahout, try building the models yourself (Streaming, R, or Java) 10/25/2013 Proprietary and Confidential
  • 29. 26 / 30 Don‟t Be Afraid to Flatten Data • Going from RDMS to Hadoop: • Don‟t dread De-normalization • For good? Probably Not… 10/25/2013 Proprietary and Confidential
  • 30. 27 / 30 Use “Hadoop beat ABC by 400x” Sparingly Everyone will get the point: “A big cluster can totally whomp on your other systems” Be nice. 10/25/2013 10 8 Proprietary and Confidential
  • 31. 28 / 30 Ask Questions Of Data Ask old questions previously unanswerable • Depth? Breadth? • Scale? Detail? Ask new questions: previously unthinkable 10/25/2013 Proprietary and Confidential
  • 32. 29 / 30 Data Science Is Science Response Time is the most important part of any data science platform‟s SLA Think of Pasteur‟s Quadrant.. * Seek Understanding of Data * Seek Practical Use of Data Your Lab * The Lab is not the Factory * The Factory is not the Lab Applied and Basic research Quest for fundamental understanding ? Yes No Pure basic research (Bohr) Use-inspired basic research (Pasteur) – Pure applied research (Edison) No Yes Considerations of use? 10/25/2013 Proprietary and Confidential
  • 33. 30 / 30 Don‟t Forget Visualization • Tools (commercial & open source) Too Many to mention! • Query tools + Query Engines = Awesome 10/25/2013 Proprietary and Confidential
  • 34. 31 / 30….. Have Fun! https://www.slideshare.net/markslusar For High Level Use Case Worksheets Huge Thanks to the Organizers! O’Reilly & Cloudera Contact me @MarkSlusar Allstate is always interested in Data Scientists & Engineers! Contact me or visit: http://careers.allstate.com/ 10/25/2013 Proprietary and Confidential
  • 35. Worksheet #1 Hadoop Use Cases Determine Use Cases, Example Below: • ETL • Extremely Responsive & Nimble Collection of tools & APIs: Hive, Pig, Streaming API (Python, et al) • Descriptive Analytics (aka BI) • Using built-in tools (Hive, Pig, Streaming API) • Using COTS tools (Commercial & Open) with streaming API & query engines (Impala, Hive, et al) • Predictive Analytics • Using tools like R (streaming) and Python (numpy, scipy, scikit, & anaconda over streaming) • Storage & Archival • Very low cost, highly fault-tolerant, very responsive • {{ And more, YMMV }} 10/25/2013 Proprietary and Confidential
  • 36. Worksheet #2 Data Science Ops Determine Ops Usage, Example Below: • Ad-Hoc Operations: One-off transactions • Sustainment Operations: A repeatable & trusted process • Research Operations: Trying new queries, software, approaches, methods • Development Operations: Creating a Defined Operational Process for Sustainment • Test Operations: Validating Data Quality, Consistency, Speed, Coverage, et al • Governance Operations: Validating Security Permissions, Lineage, Usage, Importance, De-Duplication. • {{ And more, YMMV }} 10/25/2013 Proprietary and Confidential
  • 37. Worksheet #3 Crossing “Hadoop Use Cases” with the “Ops Usage” Your Outcome may vary… Storage & Archival ETL Descriptive Analytics Predictive Analytics Ad Hoc Ops N/A Analysts Data Science Data Science Sustainment Ops Data Management Data Management Analysts And Data Management Data Science Research Ops Data Science Data Science Data Science Data Science Development Ops N/A Data Management Data Science Data Science Test Ops Data Stewardship Data Stewardship Data Science Data Science Governance Ops Data Stewardship Data Stewardship Data Stewardship Data Stewardship 10/25/2013 Proprietary and Confidential
  • 38. Worksheet #4 Crossing “Hadoop Use Cases” with your Organization Your Outcome may vary… Storage & Archival Research ETL Offload Descriptiv e Analytics Predictive Analytics X X X X X X X X X X X Marketing Sales & Pricing IT Ops X X Delivery X X Other Other Other 10/25/2013 Proprietary and Confidential