SlideShare a Scribd company logo
1 of 29
© Hortonworks Inc. 2013
Hortonworks
Data Science with Hadoop – A Primer
Hadoop Summit, June 2013
Ofer Mendelevitch
ofer@hortonworks.com
@ofermend
© Hortonworks Inc. 2013 Page 2
Who am I?
currently <- c(
role=“director of data sciences”,
company=“Hortonworks”)
• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…
• Blog: www.achessdad.com
© Hortonworks Inc. 2013 Page 3
What I will be talking about?
•What is Data Science?
•Hadoop and Data Science
•Use-cases: data science with Hadoop
•How to get started?
© Hortonworks Inc. 2013 Page 4
What is Data Science?
What is a data scientist?
A person who does this
Data Product: software product whose core
functionality relies on applying statistical (or
machine learning) methods to data.
What is Data Science?
The art of building data products
© Hortonworks Inc. 2013 Page 5
Data science & big data
© Hortonworks Inc. 2013 Page 6
With Hadoop…
Time and cost of building large scale
data products is dramatically reduced
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
An Apache Hadoop Platform
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
© Hortonworks Inc. 2013
A typical Big Data Architecture
Page 8
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
HORTONWORKS
DATA PLATFORM
© Hortonworks Inc. 2013 Page 9
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed to work
together
• Affordable at scale
– Use “commodity” hardware nodes
– Self-healing; failure handled by software
– Very good at batch processing of large datasets
© Hortonworks Inc. 2013 Page 10
Hadoop improves productivity of data
scientists
•All data in one place
–Ability to store all the data in raw format
–Data silo convergence
–Data scientists will find innovative uses of combined data
assets
•Data/compute capabilities available as shared asset
–Data scientists can quickly prototype a new idea without an
up-front request for funding
© Hortonworks Inc. 2013 Page 11
Data-driven innovation is accelerated since
Hadoop is “schema on read”
I need
new data
Finally, w
e start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
“Schema change” project
Let’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
3 months
My model is
awesome!
© Hortonworks Inc. 2013 Page 12
Hadoop is ideal for pre-processing of large
raw datasets
Strip away
HTML/PDF/DOC/P
PT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term
normalization
© Hortonworks Inc. 2013 Page 13
In machine learning, very often:
more data -> better outcomes
Banko & Brill, 2001
•More examples to learn from
•More possible feature types
–We’re looking for the most useful
for our task
© Hortonworks Inc. 2013 Page 14
Use-cases
© Hortonworks Inc. 2013 Page 15
A (partial) map of data science “tasks”
Discovery
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Prediction
Classification
Predict a category
Regression
Predict a value
Recommendation
Predict a preference
Big Data Science: High energy physics, Genomics, etc
© Hortonworks Inc. 2013 Page 16
Use-case: product recommendation
•Inputs:
–Explicit product ratings (when provided)
–Implicit information: purchase transactions, page views,
comments
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum
Comments
© Hortonworks Inc. 2013 Page 17
Goal: predict a preference
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
3 2 3 1 5
U101
U102
U103
U104
U105
…
U101
U102
U103
U104
U105
…
Epic
X-Men
Hobbit
Argo
Pirates
© Hortonworks Inc. 2013 Page 18
Using Hadoop for recommendation
Pre-process
SQL
Online serving
HDFS
Map Reduce
Transactions
Page views
Content
Recommend
Data sources
Custom
Logic
With Hadoop, we can process
very large preference datasets
© Hortonworks Inc. 2013 Page 19
Use-case: failure prediction
•Inputs:
–Equipment history: install date, model, past issues
–Equipment sensor data
–Product catalog: product families, expected lifetime
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
history
Sensor data
Product
Catalog
© Hortonworks Inc. 2013 Page 20
Building a prediction model
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
Unseen data
Model
TTF
Labeled Data
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
332456 3/3/2013 1345 94005 71
442343 6/6/2013 1112 77485 67
© Hortonworks Inc. 2013 Page 21
Using Hadoop for failure prediction
• HDFS: central repository for all data
– Service records (word, pdf, etc)
– Equipment purchase transaction data
– Product catalog: SKUs, model numbers, etc
• Pre-process
– Convert service records to item features: remove PDF
formatting, detect entities in records
– Normalize data using service records, product catalog
– Create feature matrix; ready for modeling algorithm
© Hortonworks Inc. 2013 Page 22
Use-case: SaaS application security
•Inputs:
–Click-stream: user interaction with application
User ID User
since
Logins/m
onth
Avg DL
KB/day
…
123456 1/3/2004 6 30
998323 5/3/2009 1 5
345375 8/2/2005 22 120
… … … …
User data
Clicks
© Hortonworks Inc. 2013 Page 23
Detecting anomalous behavior records
• User access profile modeled as vector of features
• Detect anomalies in application access patterns
– Rules based
– Machine learning based (determine “outlier factor”: 0…1)
© Hortonworks Inc. 2013 Page 24
Using Hadoop for anomaly detection
• HDFS: central repository for all raw data
– Raw user-access logs
– User information (organization, demographics)
• Pre-process
– Build access-profile (behavioral) for each user
• Detect anomalies
– In Hadoop
– Using existing tools: R, SAS, rules engine, etc
© Hortonworks Inc. 2013 Page 25
How do I get started?
© Hortonworks Inc. 2013 Page 26
1. Pick a good use-case that delivers immediate
business value
2. Implement a proof-of-value (POV)
3. Build a team (hire/train)
Getting started with Data science on Hadoop
© Hortonworks Inc. 2013 Page 27
• Put together a Hadoop cluster
• Define the POV business use-case
• Pull raw data you need into the cluster
• Build it
• Show the business value of your data assets
Contact us. We can help!
Implement a proof-of-value
© Hortonworks Inc. 2013 Page 28
Build a team:
The data scientist skillset continuum
Software
engineer
Research
Scientist
Data
Engineer
Data
Scientist
Applied
Scientist
Role Data Engineer Applied Scientist
Function Builds production-grade data products Finds signal/meaning in the data
Applies statistical/ML models and tunes the
algorithm
Good at…. Data and Systems architecture
Hadoop, PIG/HIVE, MapReduce, mahout
Java, Python, Perl, SQL, C++, etc
NoSQL (Hbase, Cassandra, Mongo)
Statistics, Machine learning
Text processing, NLP
R, Matlab, SAS, SQL
Sciptring, prototyping
Visualization / telling the story
© Hortonworks Inc. 2013 Page 29
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend
We’re hiring!
Data Science training: www.hortonworks.com/training

More Related Content

What's hot

Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariHortonworks
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeDataWorks Summit
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 

What's hot (20)

Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-Time
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 

Viewers also liked

BMC BSM - Automate Service Management System
BMC BSM - Automate Service Management SystemBMC BSM - Automate Service Management System
BMC BSM - Automate Service Management SystemVyom Labs
 
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)nyccamp
 
Fibre Channel 基礎講座
Fibre Channel 基礎講座Fibre Channel 基礎講座
Fibre Channel 基礎講座Brocade
 
Software Quality Plan
Software Quality PlanSoftware Quality Plan
Software Quality Planguy_davis
 
AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스Amazon Web Services Korea
 
Fast+plants+essay
Fast+plants+essayFast+plants+essay
Fast+plants+essayjespinal5
 
Hematology learning guide
Hematology learning guide Hematology learning guide
Hematology learning guide Fidaa Jaafrah
 
Furan Testing of Transformers Oil
Furan Testing of Transformers OilFuran Testing of Transformers Oil
Furan Testing of Transformers OilNitish Kumar
 
2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the US2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the USBrian Snyder
 
Cách làm Email marketing thành công!
Cách làm Email marketing thành công!Cách làm Email marketing thành công!
Cách làm Email marketing thành công!missbik
 
Cowboy tools and attire
Cowboy tools and attireCowboy tools and attire
Cowboy tools and attireChristianN2T
 
Sustainable Leadership
Sustainable LeadershipSustainable Leadership
Sustainable LeadershipLaura Pasquini
 
Effect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythmEffect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythmAhmad Thanin
 
Icons and Stencils for Hadoop
Icons and Stencils for HadoopIcons and Stencils for Hadoop
Icons and Stencils for HadoopHortonworks
 

Viewers also liked (19)

BMC BSM - Automate Service Management System
BMC BSM - Automate Service Management SystemBMC BSM - Automate Service Management System
BMC BSM - Automate Service Management System
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
 
Gourmet Company Presentation
Gourmet Company PresentationGourmet Company Presentation
Gourmet Company Presentation
 
Fibre Channel 基礎講座
Fibre Channel 基礎講座Fibre Channel 基礎講座
Fibre Channel 基礎講座
 
Medical Graphs
Medical GraphsMedical Graphs
Medical Graphs
 
Software Quality Plan
Software Quality PlanSoftware Quality Plan
Software Quality Plan
 
AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스AWS를 활용한 미디어 스트리밍 서비스
AWS를 활용한 미디어 스트리밍 서비스
 
Fast+plants+essay
Fast+plants+essayFast+plants+essay
Fast+plants+essay
 
Hematology learning guide
Hematology learning guide Hematology learning guide
Hematology learning guide
 
Furan Testing of Transformers Oil
Furan Testing of Transformers OilFuran Testing of Transformers Oil
Furan Testing of Transformers Oil
 
2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the US2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the US
 
Cách làm Email marketing thành công!
Cách làm Email marketing thành công!Cách làm Email marketing thành công!
Cách làm Email marketing thành công!
 
Cowboy tools and attire
Cowboy tools and attireCowboy tools and attire
Cowboy tools and attire
 
Selenium at Salesforce Scale
Selenium at Salesforce ScaleSelenium at Salesforce Scale
Selenium at Salesforce Scale
 
Sustainable Leadership
Sustainable LeadershipSustainable Leadership
Sustainable Leadership
 
Effect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythmEffect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythm
 
Icons and Stencils for Hadoop
Icons and Stencils for HadoopIcons and Stencils for Hadoop
Icons and Stencils for Hadoop
 

Similar to Data Science with Hadoop: A Primer

Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIKognitio
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Michael Hiskey
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionHortonworks
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 

Similar to Data Science with Hadoop: A Primer (20)

Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BI
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Data Science with Hadoop: A Primer

  • 1. © Hortonworks Inc. 2013 Hortonworks Data Science with Hadoop – A Primer Hadoop Summit, June 2013 Ofer Mendelevitch ofer@hortonworks.com @ofermend
  • 2. © Hortonworks Inc. 2013 Page 2 Who am I? currently <- c( role=“director of data sciences”, company=“Hortonworks”) • Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc… • Blog: www.achessdad.com
  • 3. © Hortonworks Inc. 2013 Page 3 What I will be talking about? •What is Data Science? •Hadoop and Data Science •Use-cases: data science with Hadoop •How to get started?
  • 4. © Hortonworks Inc. 2013 Page 4 What is Data Science? What is a data scientist? A person who does this Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data. What is Data Science? The art of building data products
  • 5. © Hortonworks Inc. 2013 Page 5 Data science & big data
  • 6. © Hortonworks Inc. 2013 Page 6 With Hadoop… Time and cost of building large scale data products is dramatically reduced
  • 7. © Hortonworks Inc. 2013 ApplianceCloudOS / VM An Apache Hadoop Platform HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, … Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 8. © Hortonworks Inc. 2013 A typical Big Data Architecture Page 8 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications HORTONWORKS DATA PLATFORM
  • 9. © Hortonworks Inc. 2013 Page 9 Keys to Hadoop’s power • Computation co-located with data – Data and computation system co-designed to work together • Affordable at scale – Use “commodity” hardware nodes – Self-healing; failure handled by software – Very good at batch processing of large datasets
  • 10. © Hortonworks Inc. 2013 Page 10 Hadoop improves productivity of data scientists •All data in one place –Ability to store all the data in raw format –Data silo convergence –Data scientists will find innovative uses of combined data assets •Data/compute capabilities available as shared asset –Data scientists can quickly prototype a new idea without an up-front request for funding
  • 11. © Hortonworks Inc. 2013 Page 11 Data-driven innovation is accelerated since Hadoop is “schema on read” I need new data Finally, w e start collecting Let me see… is it any good? Start 6 months 9 months “Schema change” project Let’s just put it in a folder on HDFS Let me see… is it any good? 3 months My model is awesome!
  • 12. © Hortonworks Inc. 2013 Page 12 Hadoop is ideal for pre-processing of large raw datasets Strip away HTML/PDF/DOC/P PT Entity resolution Document vector generation Sampling, filtering Joins Raw Data Processed Data Term normalization
  • 13. © Hortonworks Inc. 2013 Page 13 In machine learning, very often: more data -> better outcomes Banko & Brill, 2001 •More examples to learn from •More possible feature types –We’re looking for the most useful for our task
  • 14. © Hortonworks Inc. 2013 Page 14 Use-cases
  • 15. © Hortonworks Inc. 2013 Page 15 A (partial) map of data science “tasks” Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Affinity Analysis Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Big Data Science: High energy physics, Genomics, etc
  • 16. © Hortonworks Inc. 2013 Page 16 Use-case: product recommendation •Inputs: –Explicit product ratings (when provided) –Implicit information: purchase transactions, page views, comments 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates U101 U102 U103 U104 U105 … Ratings Page views Forum Comments
  • 17. © Hortonworks Inc. 2013 Page 17 Goal: predict a preference 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates 5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5 U101 U102 U103 U104 U105 … U101 U102 U103 U104 U105 … Epic X-Men Hobbit Argo Pirates
  • 18. © Hortonworks Inc. 2013 Page 18 Using Hadoop for recommendation Pre-process SQL Online serving HDFS Map Reduce Transactions Page views Content Recommend Data sources Custom Logic With Hadoop, we can process very large preference datasets
  • 19. © Hortonworks Inc. 2013 Page 19 Use-case: failure prediction •Inputs: –Equipment history: install date, model, past issues –Equipment sensor data –Product catalog: product families, expected lifetime SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 … … … … history Sensor data Product Catalog
  • 20. © Hortonworks Inc. 2013 Page 20 Building a prediction model SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 … … … … Unseen data Model TTF Labeled Data SKU Install date Service Person ID Zip code Avg temp 332456 3/3/2013 1345 94005 71 442343 6/6/2013 1112 77485 67
  • 21. © Hortonworks Inc. 2013 Page 21 Using Hadoop for failure prediction • HDFS: central repository for all data – Service records (word, pdf, etc) – Equipment purchase transaction data – Product catalog: SKUs, model numbers, etc • Pre-process – Convert service records to item features: remove PDF formatting, detect entities in records – Normalize data using service records, product catalog – Create feature matrix; ready for modeling algorithm
  • 22. © Hortonworks Inc. 2013 Page 22 Use-case: SaaS application security •Inputs: –Click-stream: user interaction with application User ID User since Logins/m onth Avg DL KB/day … 123456 1/3/2004 6 30 998323 5/3/2009 1 5 345375 8/2/2005 22 120 … … … … User data Clicks
  • 23. © Hortonworks Inc. 2013 Page 23 Detecting anomalous behavior records • User access profile modeled as vector of features • Detect anomalies in application access patterns – Rules based – Machine learning based (determine “outlier factor”: 0…1)
  • 24. © Hortonworks Inc. 2013 Page 24 Using Hadoop for anomaly detection • HDFS: central repository for all raw data – Raw user-access logs – User information (organization, demographics) • Pre-process – Build access-profile (behavioral) for each user • Detect anomalies – In Hadoop – Using existing tools: R, SAS, rules engine, etc
  • 25. © Hortonworks Inc. 2013 Page 25 How do I get started?
  • 26. © Hortonworks Inc. 2013 Page 26 1. Pick a good use-case that delivers immediate business value 2. Implement a proof-of-value (POV) 3. Build a team (hire/train) Getting started with Data science on Hadoop
  • 27. © Hortonworks Inc. 2013 Page 27 • Put together a Hadoop cluster • Define the POV business use-case • Pull raw data you need into the cluster • Build it • Show the business value of your data assets Contact us. We can help! Implement a proof-of-value
  • 28. © Hortonworks Inc. 2013 Page 28 Build a team: The data scientist skillset continuum Software engineer Research Scientist Data Engineer Data Scientist Applied Scientist Role Data Engineer Applied Scientist Function Builds production-grade data products Finds signal/meaning in the data Applies statistical/ML models and tunes the algorithm Good at…. Data and Systems architecture Hadoop, PIG/HIVE, MapReduce, mahout Java, Python, Perl, SQL, C++, etc NoSQL (Hbase, Cassandra, Mongo) Statistics, Machine learning Text processing, NLP R, Matlab, SAS, SQL Sciptring, prototyping Visualization / telling the story
  • 29. © Hortonworks Inc. 2013 Page 29 Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend We’re hiring! Data Science training: www.hortonworks.com/training

Editor's Notes

  1. Data science is not new. But now we need to do it with much larger datasets.
  2. As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring