SlideShare a Scribd company logo
Practical Introduction to
Hadoop
Alex Gorbachev
30-May-2013
© 2012 – Pythian
Who is Pythian?
15 Years of Data
infrastructure management
consulting
170+ Top brands
6000+ databases under
management
Over 200 DBA’s, in 26
countries
Top 5% of DBA work force, 9
Oracle ACE’s, 2 Microsoft
MVP’s
Oracle, Microsoft, MySQL,
Netezza, Hadoop, MongoDB,
Oracle Apps, Enterprise
Infrastructure
© 2012 – Pythian
When to Engage Pythian?
Tier 3 Data
Local Retailer
Strategic upside value
from data
Tier 2 Data
eCommerce
Tier 1 Data
Health Care
Profit
Loss
Impact of an incident,
whether it be data loss,
security, human error,
etc.
Value of Data
LOVE YOUR DATA
© 2012 – Pythian
4
© 2012 Pythian
Alex Gorbachev
• Chief Technology Officer at Pythian
• Blogger
• Cloudera Champion of Big Data
• OakTable Network member
• Oracle ACE Director
• Founder of BattleAgainstAnyGuess.com
• Founder of Sydney Oracle Meetup
• IOUG Director of Communities
• EVP, Ottawa Oracle User Grou
4
© 2012 – Pythian
Given enough skill and money –
relational databases can do anything.
Sometimes it’s just unfeasibly
expensive.
What is Hadoop?
© 2012 – Pythian
Hadoop Principles
Bring Code to Data Share Nothing
© 2012 – Pythian
Hadoop in a Nutshell
Replicated Distributed Big-
Data File System
Map-Reduce - framework for
writing massively parallel
jobs
© 2012 – Pythian
HDFS architecture
simplified view
• Files are split in large blocks
• Each block is replicated on
write
• A file can be only created and
deleted by one client
• Uploading new data? => new file
• Append supported in recent versions
• Update data? => recreate file
• No concurrent writes to a file
• Clients transfer blocks directly
to & from data nodes
• Data nodes use cheap local
disks
• Local reads are the most
efficient
© 2012 – Pythian
HDFS design principles
© 2012 – Pythian
Map Reduce example histogram
calculation
© 2012 – Pythian
Map Reduce pros & cons
Advantages
• Simple programming
paradigm
• Flexible
• Highly scalable
• Good fit for HDFS – mappers
read locally
• Fault tolerant
• Task failure or node failure
doesn’t affect the whole job –
they are restartable
Pitfalls
• Low efficiency
• Lots of intermediate data
• Lots of network traffic on
shuffle
• Complex manipulation
requires pipeline of multiple
jobs
• No high-level language
• Only mappers leverage
local reads on HDFS
© 2012 – Pythian
Some components of Hadoop
ecosystem
• Hive – HiveQL is SQL like query language
• Generates MapReduce jobs
• Pig – data sets manipulation language (like create your
own query execution plan)
• Generates MapReduce jobs
• Mahout – machine learning libraries
• Generates MapReduce jobs
• Oozie – workflow scheduler services
• Sqoop – transfer data between Hadoop and relational
database
© 2012 – Pythian
MapReduce is SLOW!
Speed through massive
parallelization
© 2012 – Pythian
Non-MR processing on Hadoop
• HBase – columnar-oriented key-value store (NoSQL)
• SQL without Map Reduce
• Impala (Cloudera)
• Drill (MapR)
• Phoenix (Salesforce.com)
• Hadapt (commercial)
• Shark – Spark in-memory analytics on Hadoop
• Platfora (commercial)
• In-memory analytics + visualization & reporting tool
© 2012 – Pythian
Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
© 2012 – Pythian
• Hadoop is scalable but not fast
• Some assembly required
• Batteries not included
• Instrumentation not included
either
• DIY mindset (remember
MySQL?)
• Commercial distributions are
not free
• Simplistic security models
Hadoop Limitations
© 2012 – Pythian
How much does it cost?
$300K DIY on SuperMicro
• 100 data nodes
• 2 name nodes
• 3 racks
• 800 Sandy Bridge CPU
cores
• 6.4 TB RAM
• 600 x 2TB disks
• 1.2 PB of raw disk capacity
• 400 TB usable (triple
mirror)
• Open-source s/w, maybe
commercial distribution
© 2012 – Pythian
Hadoop vs. Relational Database
Cool
Load first, structure later
“Cheap” hardware
DIY
Flexible data store
Effectiveness via scale
Petabytes
100s – 1000s cluster nodes
⇔Old
⇔Structure first, load later
⇔Enterprise grade hardware
⇔Repeatable solutions
⇔Efficient data store
⇔Effectiveness via efficiency
⇔Terabytes
⇔Dozens of nodes (maybe)
Big Data and Hadoop Use Cases
© 2012 – Pythian
Use Cases for Big Data
• Top-line contributors
• Analyze customer behavior
• Optimize ad placements
• Customized promotions and etc
• Recommendation systems
• Netflix, Pandora, Amazon
• New products and services
• Prismatic, smart home
© 2012 – Pythian
© 2012 – Pythian
Use Cases for Big Data
• Top-line contributors
• Analyze customer behavior
• Optimize ad placements
• Customized promotions and etc
• Recommendation systems
• Netflix, Pandora, Amazon
• New products and services
• Prismatic, smart home
© 2012 – Pythian
Use Cases for Big Data
• Bottom-line contributors
• Cheap archives storage
• ETL layer – transformation engine, data cleansing
•
•
•
•
•
•
•
•
© 2012 – Pythian
Typical Initial Use-Cases for
Hadoop in modern Enterprise IT
• Transformation engine (part of ETL)
• Scales easily
• Inexpensive processing capacity
• Any data source and destination
• Data Landfill
• Stop throwing away any data
• Don’t know how to use data today? Maybe tomorrow you will
• Hadoop is very inexpensive but very reliable
© 2012 – Pythian
Advanced: Data Science
Platform
• Data warehouse is good when questions are known, data
domain and structure is defined
• Hadoop is great for seeking new meaning of data, new
types of insights
• Unique information parsing and interpretation
• Huge variety of data sources and domains
• When new insights are found and
new structure defined, Hadoop often
takes place of ETL engine
• Newly structured information is then
loaded to more traditional data-
warehouses (still today)
© 2012 – Pythian
Pythian Internal Hadoop Use
• OCR of screen video capture from Pythian privileged
access surveillance system
• Input raw frames from video capture
• Map-Reduce job runs OCR on frames and produces text
• Map-Reduce job identifies text changes from frame to frame and
produces text stream with timestamp when it was on the screen
• Other Map-Reduce jobs mine text (and keystrokes) for insights
• Credit Cart patterns
• Sensitive commands (like DROP TABLE)
• Root access
• Unusual activity patterns
• Merge with monitoring and documentation systems
© 2012 – Pythian
2
Thank you & Q&A
http://www.pythian.com/blog/
http://www.facebook.com/pages/The-Pythian-Group/
http://twitter.com/pythian http://twitter.com/alexgorbachev
http://www.linkedin.com/company/pythian
1-866-PYTHIAN
sales@pythian.com gorbachev@pythian.com
To contact us…
To follow us…

More Related Content

What's hot

Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
Mark Kromer
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 

What's hot (20)

Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 

Similar to Practical introduction to hadoop

Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
Dmitry Tolpeko
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
Idan Tohami
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
solarisyougood
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
xKinAnx
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Alex Gorbachev
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
DataWorks Summit
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek1
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Eric Baldeschwieler
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 

Similar to Practical introduction to hadoop (20)

Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Practical introduction to hadoop

  • 2. © 2012 – Pythian Who is Pythian? 15 Years of Data infrastructure management consulting 170+ Top brands 6000+ databases under management Over 200 DBA’s, in 26 countries Top 5% of DBA work force, 9 Oracle ACE’s, 2 Microsoft MVP’s Oracle, Microsoft, MySQL, Netezza, Hadoop, MongoDB, Oracle Apps, Enterprise Infrastructure
  • 3. © 2012 – Pythian When to Engage Pythian? Tier 3 Data Local Retailer Strategic upside value from data Tier 2 Data eCommerce Tier 1 Data Health Care Profit Loss Impact of an incident, whether it be data loss, security, human error, etc. Value of Data LOVE YOUR DATA
  • 4. © 2012 – Pythian 4 © 2012 Pythian Alex Gorbachev • Chief Technology Officer at Pythian • Blogger • Cloudera Champion of Big Data • OakTable Network member • Oracle ACE Director • Founder of BattleAgainstAnyGuess.com • Founder of Sydney Oracle Meetup • IOUG Director of Communities • EVP, Ottawa Oracle User Grou 4
  • 5. © 2012 – Pythian Given enough skill and money – relational databases can do anything. Sometimes it’s just unfeasibly expensive.
  • 7. © 2012 – Pythian Hadoop Principles Bring Code to Data Share Nothing
  • 8. © 2012 – Pythian Hadoop in a Nutshell Replicated Distributed Big- Data File System Map-Reduce - framework for writing massively parallel jobs
  • 9. © 2012 – Pythian HDFS architecture simplified view • Files are split in large blocks • Each block is replicated on write • A file can be only created and deleted by one client • Uploading new data? => new file • Append supported in recent versions • Update data? => recreate file • No concurrent writes to a file • Clients transfer blocks directly to & from data nodes • Data nodes use cheap local disks • Local reads are the most efficient
  • 10. © 2012 – Pythian HDFS design principles
  • 11. © 2012 – Pythian Map Reduce example histogram calculation
  • 12. © 2012 – Pythian Map Reduce pros & cons Advantages • Simple programming paradigm • Flexible • Highly scalable • Good fit for HDFS – mappers read locally • Fault tolerant • Task failure or node failure doesn’t affect the whole job – they are restartable Pitfalls • Low efficiency • Lots of intermediate data • Lots of network traffic on shuffle • Complex manipulation requires pipeline of multiple jobs • No high-level language • Only mappers leverage local reads on HDFS
  • 13. © 2012 – Pythian Some components of Hadoop ecosystem • Hive – HiveQL is SQL like query language • Generates MapReduce jobs • Pig – data sets manipulation language (like create your own query execution plan) • Generates MapReduce jobs • Mahout – machine learning libraries • Generates MapReduce jobs • Oozie – workflow scheduler services • Sqoop – transfer data between Hadoop and relational database
  • 14. © 2012 – Pythian MapReduce is SLOW! Speed through massive parallelization
  • 15. © 2012 – Pythian Non-MR processing on Hadoop • HBase – columnar-oriented key-value store (NoSQL) • SQL without Map Reduce • Impala (Cloudera) • Drill (MapR) • Phoenix (Salesforce.com) • Hadapt (commercial) • Shark – Spark in-memory analytics on Hadoop • Platfora (commercial) • In-memory analytics + visualization & reporting tool
  • 16. © 2012 – Pythian Hadoop Benefits • Reliable solution based on unreliable hardware • Designed for large files • Load data first, structure later • Designed to maximize throughput of large scans • Designed to leverage parallelism • Designed to scale • Flexible development platform • Solution Ecosystem
  • 17. © 2012 – Pythian • Hadoop is scalable but not fast • Some assembly required • Batteries not included • Instrumentation not included either • DIY mindset (remember MySQL?) • Commercial distributions are not free • Simplistic security models Hadoop Limitations
  • 18. © 2012 – Pythian How much does it cost? $300K DIY on SuperMicro • 100 data nodes • 2 name nodes • 3 racks • 800 Sandy Bridge CPU cores • 6.4 TB RAM • 600 x 2TB disks • 1.2 PB of raw disk capacity • 400 TB usable (triple mirror) • Open-source s/w, maybe commercial distribution
  • 19. © 2012 – Pythian Hadoop vs. Relational Database Cool Load first, structure later “Cheap” hardware DIY Flexible data store Effectiveness via scale Petabytes 100s – 1000s cluster nodes ⇔Old ⇔Structure first, load later ⇔Enterprise grade hardware ⇔Repeatable solutions ⇔Efficient data store ⇔Effectiveness via efficiency ⇔Terabytes ⇔Dozens of nodes (maybe)
  • 20. Big Data and Hadoop Use Cases
  • 21. © 2012 – Pythian Use Cases for Big Data • Top-line contributors • Analyze customer behavior • Optimize ad placements • Customized promotions and etc • Recommendation systems • Netflix, Pandora, Amazon • New products and services • Prismatic, smart home
  • 22. © 2012 – Pythian
  • 23. © 2012 – Pythian Use Cases for Big Data • Top-line contributors • Analyze customer behavior • Optimize ad placements • Customized promotions and etc • Recommendation systems • Netflix, Pandora, Amazon • New products and services • Prismatic, smart home
  • 24. © 2012 – Pythian Use Cases for Big Data • Bottom-line contributors • Cheap archives storage • ETL layer – transformation engine, data cleansing • • • • • • • •
  • 25. © 2012 – Pythian Typical Initial Use-Cases for Hadoop in modern Enterprise IT • Transformation engine (part of ETL) • Scales easily • Inexpensive processing capacity • Any data source and destination • Data Landfill • Stop throwing away any data • Don’t know how to use data today? Maybe tomorrow you will • Hadoop is very inexpensive but very reliable
  • 26. © 2012 – Pythian Advanced: Data Science Platform • Data warehouse is good when questions are known, data domain and structure is defined • Hadoop is great for seeking new meaning of data, new types of insights • Unique information parsing and interpretation • Huge variety of data sources and domains • When new insights are found and new structure defined, Hadoop often takes place of ETL engine • Newly structured information is then loaded to more traditional data- warehouses (still today)
  • 27. © 2012 – Pythian Pythian Internal Hadoop Use • OCR of screen video capture from Pythian privileged access surveillance system • Input raw frames from video capture • Map-Reduce job runs OCR on frames and produces text • Map-Reduce job identifies text changes from frame to frame and produces text stream with timestamp when it was on the screen • Other Map-Reduce jobs mine text (and keystrokes) for insights • Credit Cart patterns • Sensitive commands (like DROP TABLE) • Root access • Unusual activity patterns • Merge with monitoring and documentation systems
  • 28. © 2012 – Pythian 2 Thank you & Q&A http://www.pythian.com/blog/ http://www.facebook.com/pages/The-Pythian-Group/ http://twitter.com/pythian http://twitter.com/alexgorbachev http://www.linkedin.com/company/pythian 1-866-PYTHIAN sales@pythian.com gorbachev@pythian.com To contact us… To follow us…

Editor's Notes

  1. Big Data is not new. What’s new is that it’s not cost prohibitive anymore for broad commercial adoption.
  2. The ideas are simple:Data is big, code is small. It is far more efficient to move the small code to the big data than vice versa. It you have a pillow and a sofa in a room, you typically move the pillow to the sofa, and not vice versa. But many developers are too comfortable with the select-than-process anti-pattern. This principle is in place to help with the throughput challenges.2. Sharing is nice, but safe sharing of data typically mean locking, queueing, bottle necks and race conditions. It is notoriously difficult to get concurrency right, and even if you do – it is slower than the alternative. Hadoop works around the whole thing. This principle is in place to deal with parallel processing challenges
  3. Amazon EMR uses S3. MapR uses MapRFS.Many non-MR applications too.
  4. Default block size is 64M,You can place any data file in HDFS. Later processing can find the meaning in the data.
  5. Many projects fail because people imagine a very rosy image of Hadoop. They think they can just throw all the data there and it will magically and quickly become value. Such misguided expectations also happen with other platforms and doom other projects too. To be successful with Hadoop, we need to be realistic about it.
  6. Patterns – pregnant 16 y.o. example
  7. Women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc.
  8. Patterns – pregnant 16 y.o. example
  9. Patterns – pregnant 16 y.o. example