Top Down Approach to Understanding Big Data Concepts

 What is Big Data
› Top Down Approach to the Topic of Big Data
› Data Science and Data Scientists
› Days of Data Past – Part I
› Days of Data Past – Part II
› Days of Data Past – Part III
› Three V’s
› “Data Lake” Architecture
 Who can use Big Data
› Individual Experience vs Collective Experience
› Business Cases
› Use Cases
 How to Use Big Data
› Coming of Hadoop
› Evolution of Hadoop
› “Other than” HDFS

 Data Science and Data Scientists
› Science of mining, extracting, analyzing,
modeling, visualizing large data sets from
multiple sources
› “Data analyst, data artist”
› Knowledge of math, statistics, predictive
modeling, pattern recognition and learning, data
visualization, data warehousing, etc
› From C.F. Jeff Wu to William S. Cleveland to
“Data Science Journal” and beyond

 Days of Data Past - Part 1
› Relational databases and their impact
› Write-first schema
› ACID compliant
› Row-store technology
› Relationally structured data for smaller data sets
› Relatively cheaper products
 SQL Server, Oracle, etc
› Highly available skill-set
› SQL languages
 Data manipulation– Insert, Select, Update and Delete
 Data definition – Create, Alter, Truncate and Drop
› Influenced LINQ (in .NET) and JPQL (in Java) etc in application
programming
› Enterprise ready

 Days of Data Past - Part II
› Enterprise Data Warehouses (EDW) and their impact
› Massively Parallel Processing (MPP) appliances –not all EDW’s are
packaged as MPP appliances
› Column-store technology, faster and easier for BI – not all EDW’s use
column-store
› Dimensionally structured data for large data sets
› Enterprise storage not commodity storage
› Expensive premium products
 TeraData, Vertica, SQL Server PDW, etc
 Some major companies offers commodity hardware for low price
customer
 Some major companies offers services in addition to products
› Demanding skill-set
› Enterprise ready

 Days of Data Past - Part III
› NoSql data stores and their impact
› Not relational and Not ACID compliant
› Four types
 Key-value stores (KV)
 Document stores
 Graph database stores
 Wide column stores
› Relatively cheaper products
› Commodity storage not enterprise storages
› Demanding and scarce skill-set
› Not Enterprise ready
NewSql data stores as an alternative to NoSql
 Relational and ACID compliant
 SQL driven so that existing SQL investments are intact

 Three V’s
› Volume
 Large volumes 100 TB or more currently
 Expecting above benchmark in future
› Velocity
 How quickly data accumulates
 How quickly your data makes sense
 Batch, near-time, real-time
 Batch vs Interactive
› Variety
 Various data sources
 Structured data – relational, ERP, CRM
 Semi-structured data – click streams, weblogs, geographical,
social
 Unstructured data – sensor, textual, machine generated

 “Data Lake” Architecture
› Modern Data Architecture
 Provides a shared service for broad insight across a
large, diverse data set at efficient scale according to
HortonWorks
 A unified data architecture which integrated to
enterprise end-to-end solutions according to TeraData
› Cater to support 3V driven big data opportunities
› Raw data of unrecognized value
› Read-first schema

 Individual Experience vs Collective Experience
› Need to treat as individuals instead a mass collective
› Predictive modeling to recommend individual’s best
“intent”
› Implementing Process communication models (PCM) to
give better individualized customer service
 Listening to particular song by particular artist via mobile
 Calling to a call center
› Privacy concerns – main obstacle in current big data trend

 Business Cases
› Medical or Healthcare
› Entertainment
› Forensics
› Financial
› Retail

 Use Cases
› Medical or Healthcare
 Find a cure to a disease based on individual’s medical history,
behavior patterns, food and drug consumption, and similar
patients’ data
› Entertainment
 Provide a recommendation engine for IMDB or Netflix for
individual’s viewing patterns
› Forensics
 Capture a serial killer from historical murder data in CSI.
Similarly avoid more incidents in the similar killer pattern
› Financial
 Provide a predictive financial model for Wall Street stock market
fluctuations based on historical shareholder patterns
› Retail

 Coming of Hadoop
› GFS and Google’s MapReduce engine and
publishing of white papers by Google
› Yahoo team who first to decode the white papers and
create HDFS and an MR engine to scale out yahoo
search
› Creation of Hadoop 1.0 (Generation 1) in 2006 and
commit for Production level Hadoop by Yahoo
› Spawning the HortonWorks company in 2011 from a
set of Yahoo employees and move towards
Enterprise hardening
› Spawning multiple Hadoop distros as products

 Evolution of Hadoop
› Hadoop 1.x (Generation 1)
 Data Management – HDFS for redundant data storage from various sources and MapReduce
to process the data
 Data Access Layer (batch, near-time, real-time) - to access data simultaneously in multiple
ways
› Hadoop 2.x (Generation 2)
 Introducing YARN for Data Management layer
 Governance and Integration for Enterprise level – data loading, execute data policies, data
management – introducing Apache Falcon
 Security – authentication and authorization at a layered and secured way – Apache Knox
 Operations – deploy, monitor and manage the platform as whole – introducing Apache Ambari
› Enterprise Hadoop
 Deployment choice – Physical, virtual, cloud; distro Windows or Linux; distro product
HortonWorks or Cloudera or other
 Presentation and Applications – Enable existing and new applications to generate value from
Hadoop
 Enterprise management and security – empower existing proven enterprise tools to integrate
with Hadoop
 Services or Product choice - YARN-enabling always –on forever running services with Apache
Slider

 Hadoop 2.7 Stack (HortonWorks view)

 “Other than” Hadoop, HDFS
› HDFS-like storage systems with similar
MapReduce engines
› MapR (uses an NFS)
 Has cloud support too
› EMC, NetApp, CleverState, Symentic
› IBM’s BigInsight (kind of distro of Cloudera which
is intern distro of Hadoop)
› SAP’s HANA suite
› Of course proprietary GFS which HDFS is based
on originally

Top Down Approach to Understanding Big Data Concepts

Top Down Approach to Understanding Big Data Concepts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Top Down Approach to Understanding Big Data Concepts

Similar to Top Down Approach to Understanding Big Data Concepts (20)

Recently uploaded

Recently uploaded (20)

Top Down Approach to Understanding Big Data Concepts

Editor's Notes