© Hortonworks Inc. 2013
Big Data, Data Science & Hadoop
Ofer Mendelevitch
San Francisco Bay Area
Microsoft Business
Intelligence User Group
May 2013
© Hortonworks Inc. 2013 Page 2
Who am I?
Director of Data Sciences @ Hortonworks
• Data science with Hadoop
• Professional services
Previously…
A Chess Dad
© Hortonworks Inc. 2013 Page 3
© Hortonworks Inc. 2013 Page 4
Gartner’s 3 V’s of big data:
Volume
VelocityVariety
Size of the data
Ingest speed
Response latency
Diverse sources
Format, structure
Data quality
© Hortonworks Inc. 2013
What Makes Up Big Data?
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERPERP
CRMCRM
WEBWEB
BIG DATABIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions
+ Observations
= BIG DATA
Page 5
© Hortonworks Inc. 2013 Page 6
• Sensors/devices
• Online: social, forums, etc
• Event logs
• Etc etc…
But also:
• Data that was “thrown away “ previously
Where is all this data coming from?
© Hortonworks Inc. 2013 Page 7
I like a quote from Michael Franklin (UCB):
“Big Data is any data that is expensive to
manage and hard to extract value from”
It’s a relative term.
Today’s big data may be tomorrow’s small data.
Ok… so what is big data?
© Hortonworks Inc. 2013 Page 8
© Hortonworks Inc. 2013 Page 9
“A software system whose core
functionality depends on the
application of statistical analysis
and machine learning to data.”
What is a data product?
© Hortonworks Inc. 2013 Page 10
Example 1: Google Adwords
© Hortonworks Inc. 2013 Page 11
Example 2: People you may know
© Hortonworks Inc. 2013 Page 12
Example 3: spell correction
© Hortonworks Inc. 2013 Page 13
© Hortonworks Inc. 2013 Page 14
What is data science?
#1: Extracting deep meaning from data
(data mining; finding “gems” in data)
© Hortonworks Inc. 2013 Page 15
What is data science?
#2: Building data products
(Delivering gems on a regular basis)
Pre-process Build model SQL
Periodic batch processing
Online serving
© Hortonworks Inc. 2013 Page 16
Common data science tasks
DescriptiveDescriptive
Clustering
Detect natural groupings
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Affinity Analysis
Co-occurrence patterns
PredictivePredictive
Classification
Predict a category
Classification
Predict a category
Regression
Predict a value
Regression
Predict a value
Recommendation
Predict a preference
Recommendation
Predict a preference
© Hortonworks Inc. 2013 Page 17
© Hortonworks Inc. 2013
A brief history of Apache Hadoop
Page 18
2013
Focus on INNOVATION
2005: Yahoo! creates
team under E14 to
work on Hadoop
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
HDP: Enterprise-Ready Hadoop
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
© Hortonworks Inc. 2013
Core Hadoop: HDFS & Map Reduce
Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store
• Map-reduce: distributed computation framework that
handles the complexities of distributed programming
Page 20
© Hortonworks Inc. 2013 Page 21
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed and co-
developed to work together
• Process data in parallel across thousands of
“commodity” hardware nodes
– Self-healing; failure handled by software
• Designed for one write and multiple reads
– There are no random writes
– Optimized for minimum seek on hard drives
© Hortonworks Inc. 2013
Inside HDP for Windows
Page 22
Hortonworks
Data Platform (HDP)
For Windows
• 100% Open Source
Enterprise Hadoop
• Component and version
compatible with Microsoft
HDInsight
• Availability
• Beta release available now
• GA early 2Q 2012
PLATFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
Store,
Process and
Access Data
HORTONWORKS
DATA PLATFORM (HDP)
For Windows
Distributed
Storage & ProcessingHDFS
WEBHDFS
MAP REDUCE
HCATALOG
HIVEPIG
SQOOP
Oozie
© Hortonworks Inc. 2013
Seamless Interoperability with Your Microsoft Tools
• Integrated with Microsoft tools
for native big data analysis
– Bi-directional connectors for SQL
Server and SQL Azure through SQOOP
– Excel ODBC integration through Hive
• Addressing demand for Hadoop
on Windows
– Ideal for Windows customers with
Hadoop operational experience
• Enables all common Hadoop
workloads
– Data refinement and ETL offload for
high-volume data landing
– Data exploration for discovery of new
business opportunities
Page 23
APPLICATIONSDATASYSTEMS
Microsoft Applications
HORTONWORKS
DATA PLATFORM
For Windows
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
© Hortonworks Inc. 2013 Page 24
© Hortonworks Inc. 2013 Page 25
Data Science, now with more data…
© Hortonworks Inc. 2013 Page 26
Benefit #1:
Explore full datasets
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 27
Explore large datasets directly with Hadoop
Measure/Evaluate
Acquire
Clean DataVisualize, Grok
Model
Full dataset stored on Hadoop
Researcher laptop
R, Matlab, SAS, etc
© Hortonworks Inc. 2013 Page 28
Integrate Hadoop in your data analysis flow
•Full dataset resides in Hadoop
• Typical Hadoop tasks:
–Simple statistics: mean, median, correlation
–Text pre-processing: grep, regex, NLP
–Dimensionality reduction: PCA, SVD, clustering, etc
–Random sampling: with or without replacement, by unique
–K-fold cross-validation
© Hortonworks Inc. 2013 Page 29
Benefit #2:
Mine larger datasets
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 30
More data -> better outcomes
Banko & Brill, 2001
Halevy, Norvig & Pereira, 2009
© Hortonworks Inc. 2013 Page 31
Learning algorithms with large datasets…
Challenges:
•Data won’t fit in memory
•Learning takes a lot longer…
Using Hadoop:
•Distribute data across nodes in the Hadoop cluster
•Implement a distributed/parallel algorithm
© Hortonworks Inc. 2013 Page 32
Benefit #3:
Large-scale data preparation
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 33
80% of data science work is data preparation
Strip away
HTML/PDF/DOC/PPT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term normalization
© Hortonworks Inc. 2013 Page 34
Hadoop is ideal for batch data preparation and
cleanup of large datasets
© Hortonworks Inc. 2013 Page 35
Benefit #4:
Accelerate data-driven innovation
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 36
Barriers to speed with traditional data architectures
• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation
I need
new data
collecting
Finally,
we start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
Schema change project
© Hortonworks Inc. 2013 Page 37
“Schema on read” means faster time-to-innovation
• Hadoop uses “schema on read”
• Low barrier for data-driven innovation
I need
new data
Let’s just putLet’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
Start 3 months 6 months
My model is
awesome!
© Hortonworks Inc. 2013
Quick start: Hortonworks Sandbox
• What is it
– A free download of a virtualized single-node implementation of the enterprise-ready
Hortonworks Data Platform
– A personal Hadoop environment
– An integrated learning environment with frequently, easily updatable hands-on
step-by-step tutorials
• What it does
– Dramatically accelerates the process of learning Apache Hadoop
– Accelerate and validates the use of Hadoop within your unique data architecture
– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes
Page 38
Download Hortonworks Sandbox
www.hortonworks.com/sandbox
Sign up for Training for in-depth learning
hortonworks.com/hadoop-training/
Hadoop Summit
Page 39Architecting the Future of Big Data
• June 26-27, 2013- San Jose Convention
Center
• Co-hosted by Hortonworks & Yahoo!
• Theme: Enabling the Next Generation
Enterprise Data Platform
• 90+ Sessions and 7 Tracks
• Community Focused Event
– Sessions selected by a Conference Committee
– Community Choice allowed public to vote for
sessions they want to see
• Pre-event training classes
– Apache Hadoop Essentials: A Technical
Understanding for Business Users
– Understanding Microsoft HDInsight and Apache
Hadoop
– Developing Solutions with Apache Hadoop –
HDFS and MapReduce
– Applying Data Science using Apache Hadoop
• 10% discount code: 13DiscHUG10
hadoopsummit.org
© Hortonworks Inc. 2013 Page 40
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend, @hortonworks
We’re hiring!

Hortonworks Big Data & Hadoop

  • 1.
    © Hortonworks Inc.2013 Big Data, Data Science & Hadoop Ofer Mendelevitch San Francisco Bay Area Microsoft Business Intelligence User Group May 2013
  • 2.
    © Hortonworks Inc.2013 Page 2 Who am I? Director of Data Sciences @ Hortonworks • Data science with Hadoop • Professional services Previously… A Chess Dad
  • 3.
  • 4.
    © Hortonworks Inc.2013 Page 4 Gartner’s 3 V’s of big data: Volume VelocityVariety Size of the data Ingest speed Response latency Diverse sources Format, structure Data quality
  • 5.
    © Hortonworks Inc.2013 What Makes Up Big Data? Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record ERPERP CRMCRM WEBWEB BIG DATABIG DATA Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMSSentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity Transactions + Interactions + Observations = BIG DATA Page 5
  • 6.
    © Hortonworks Inc.2013 Page 6 • Sensors/devices • Online: social, forums, etc • Event logs • Etc etc… But also: • Data that was “thrown away “ previously Where is all this data coming from?
  • 7.
    © Hortonworks Inc.2013 Page 7 I like a quote from Michael Franklin (UCB): “Big Data is any data that is expensive to manage and hard to extract value from” It’s a relative term. Today’s big data may be tomorrow’s small data. Ok… so what is big data?
  • 8.
  • 9.
    © Hortonworks Inc.2013 Page 9 “A software system whose core functionality depends on the application of statistical analysis and machine learning to data.” What is a data product?
  • 10.
    © Hortonworks Inc.2013 Page 10 Example 1: Google Adwords
  • 11.
    © Hortonworks Inc.2013 Page 11 Example 2: People you may know
  • 12.
    © Hortonworks Inc.2013 Page 12 Example 3: spell correction
  • 13.
    © Hortonworks Inc.2013 Page 13
  • 14.
    © Hortonworks Inc.2013 Page 14 What is data science? #1: Extracting deep meaning from data (data mining; finding “gems” in data)
  • 15.
    © Hortonworks Inc.2013 Page 15 What is data science? #2: Building data products (Delivering gems on a regular basis) Pre-process Build model SQL Periodic batch processing Online serving
  • 16.
    © Hortonworks Inc.2013 Page 16 Common data science tasks DescriptiveDescriptive Clustering Detect natural groupings Clustering Detect natural groupings Outlier detection Detect anomalies Outlier detection Detect anomalies Affinity Analysis Co-occurrence patterns Affinity Analysis Co-occurrence patterns PredictivePredictive Classification Predict a category Classification Predict a category Regression Predict a value Regression Predict a value Recommendation Predict a preference Recommendation Predict a preference
  • 17.
    © Hortonworks Inc.2013 Page 17
  • 18.
    © Hortonworks Inc.2013 A brief history of Apache Hadoop Page 18 2013 Focus on INNOVATION 2005: Yahoo! creates team under E14 to work on Hadoop Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 20122006 STABILITY 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo
  • 19.
    © Hortonworks Inc.2013 ApplianceCloudOS / VM HDP: Enterprise-Ready Hadoop HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, … Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 20.
    © Hortonworks Inc.2013 Core Hadoop: HDFS & Map Reduce Deliver high-scale storage & processing • HDFS: distributed, self-healing data store • Map-reduce: distributed computation framework that handles the complexities of distributed programming Page 20
  • 21.
    © Hortonworks Inc.2013 Page 21 Keys to Hadoop’s power • Computation co-located with data – Data and computation system co-designed and co- developed to work together • Process data in parallel across thousands of “commodity” hardware nodes – Self-healing; failure handled by software • Designed for one write and multiple reads – There are no random writes – Optimized for minimum seek on hard drives
  • 22.
    © Hortonworks Inc.2013 Inside HDP for Windows Page 22 Hortonworks Data Platform (HDP) For Windows • 100% Open Source Enterprise Hadoop • Component and version compatible with Microsoft HDInsight • Availability • Beta release available now • GA early 2Q 2012 PLATFORM SERVICES HADOOP CORE DATA SERVICES OPERATIONAL SERVICES Manage & Operate at Scale Store, Process and Access Data HORTONWORKS DATA PLATFORM (HDP) For Windows Distributed Storage & ProcessingHDFS WEBHDFS MAP REDUCE HCATALOG HIVEPIG SQOOP Oozie
  • 23.
    © Hortonworks Inc.2013 Seamless Interoperability with Your Microsoft Tools • Integrated with Microsoft tools for native big data analysis – Bi-directional connectors for SQL Server and SQL Azure through SQOOP – Excel ODBC integration through Hive • Addressing demand for Hadoop on Windows – Ideal for Windows customers with Hadoop operational experience • Enables all common Hadoop workloads – Data refinement and ETL offload for high-volume data landing – Data exploration for discovery of new business opportunities Page 23 APPLICATIONSDATASYSTEMS Microsoft Applications HORTONWORKS DATA PLATFORM For Windows DATASOURCES MOBILE DATA OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • 24.
    © Hortonworks Inc.2013 Page 24
  • 25.
    © Hortonworks Inc.2013 Page 25 Data Science, now with more data…
  • 26.
    © Hortonworks Inc.2013 Page 26 Benefit #1: Explore full datasets Benefits of Hadoop for data science
  • 27.
    © Hortonworks Inc.2013 Page 27 Explore large datasets directly with Hadoop Measure/Evaluate Acquire Clean DataVisualize, Grok Model Full dataset stored on Hadoop Researcher laptop R, Matlab, SAS, etc
  • 28.
    © Hortonworks Inc.2013 Page 28 Integrate Hadoop in your data analysis flow •Full dataset resides in Hadoop • Typical Hadoop tasks: –Simple statistics: mean, median, correlation –Text pre-processing: grep, regex, NLP –Dimensionality reduction: PCA, SVD, clustering, etc –Random sampling: with or without replacement, by unique –K-fold cross-validation
  • 29.
    © Hortonworks Inc.2013 Page 29 Benefit #2: Mine larger datasets Benefits of Hadoop for data science
  • 30.
    © Hortonworks Inc.2013 Page 30 More data -> better outcomes Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009
  • 31.
    © Hortonworks Inc.2013 Page 31 Learning algorithms with large datasets… Challenges: •Data won’t fit in memory •Learning takes a lot longer… Using Hadoop: •Distribute data across nodes in the Hadoop cluster •Implement a distributed/parallel algorithm
  • 32.
    © Hortonworks Inc.2013 Page 32 Benefit #3: Large-scale data preparation Benefits of Hadoop for data science
  • 33.
    © Hortonworks Inc.2013 Page 33 80% of data science work is data preparation Strip away HTML/PDF/DOC/PPT Entity resolution Document vector generation Sampling, filtering Joins Raw Data Processed Data Term normalization
  • 34.
    © Hortonworks Inc.2013 Page 34 Hadoop is ideal for batch data preparation and cleanup of large datasets
  • 35.
    © Hortonworks Inc.2013 Page 35 Benefit #4: Accelerate data-driven innovation Benefits of Hadoop for data science
  • 36.
    © Hortonworks Inc.2013 Page 36 Barriers to speed with traditional data architectures • RDBMS uses “schema on write”; change is expensive • High barrier for data-driven innovation I need new data collecting Finally, we start collecting Let me see… is it any good? Start 6 months 9 months Schema change project
  • 37.
    © Hortonworks Inc.2013 Page 37 “Schema on read” means faster time-to-innovation • Hadoop uses “schema on read” • Low barrier for data-driven innovation I need new data Let’s just putLet’s just put it in a folder on HDFS Let me see… is it any good? Start 3 months 6 months My model is awesome!
  • 38.
    © Hortonworks Inc.2013 Quick start: Hortonworks Sandbox • What is it – A free download of a virtualized single-node implementation of the enterprise-ready Hortonworks Data Platform – A personal Hadoop environment – An integrated learning environment with frequently, easily updatable hands-on step-by-step tutorials • What it does – Dramatically accelerates the process of learning Apache Hadoop – Accelerate and validates the use of Hadoop within your unique data architecture – Use your data to explore and investigate your use cases • ZERO to big data in 15 minutes Page 38 Download Hortonworks Sandbox www.hortonworks.com/sandbox Sign up for Training for in-depth learning hortonworks.com/hadoop-training/
  • 39.
    Hadoop Summit Page 39Architectingthe Future of Big Data • June 26-27, 2013- San Jose Convention Center • Co-hosted by Hortonworks & Yahoo! • Theme: Enabling the Next Generation Enterprise Data Platform • 90+ Sessions and 7 Tracks • Community Focused Event – Sessions selected by a Conference Committee – Community Choice allowed public to vote for sessions they want to see • Pre-event training classes – Apache Hadoop Essentials: A Technical Understanding for Business Users – Understanding Microsoft HDInsight and Apache Hadoop – Developing Solutions with Apache Hadoop – HDFS and MapReduce – Applying Data Science using Apache Hadoop • 10% discount code: 13DiscHUG10 hadoopsummit.org
  • 40.
    © Hortonworks Inc.2013 Page 40 Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend, @hortonworks We’re hiring!