2. What is Big Data
› Top Down Approach to the Topic of Big Data
› Data Science and Data Scientists
› Days of Data Past – Part I
› Days of Data Past – Part II
› Days of Data Past – Part III
› Three V’s
› “Data Lake” Architecture
Who can use Big Data
› Individual Experience vs Collective Experience
› Business Cases
› Use Cases
How to Use Big Data
› Coming of Hadoop
› Evolution of Hadoop
› “Other than” HDFS
3. Data Science and Data Scientists
› Science of mining, extracting, analyzing,
modeling, visualizing large data sets from
multiple sources
› “Data analyst, data artist”
› Knowledge of math, statistics, predictive
modeling, pattern recognition and learning, data
visualization, data warehousing, etc
› From C.F. Jeff Wu to William S. Cleveland to
“Data Science Journal” and beyond
4. Days of Data Past - Part 1
› Relational databases and their impact
› Write-first schema
› ACID compliant
› Row-store technology
› Relationally structured data for smaller data sets
› Relatively cheaper products
SQL Server, Oracle, etc
› Highly available skill-set
› SQL languages
Data manipulation– Insert, Select, Update and Delete
Data definition – Create, Alter, Truncate and Drop
› Influenced LINQ (in .NET) and JPQL (in Java) etc in application
programming
› Enterprise ready
5. Days of Data Past - Part II
› Enterprise Data Warehouses (EDW) and their impact
› Massively Parallel Processing (MPP) appliances –not all EDW’s are
packaged as MPP appliances
› Column-store technology, faster and easier for BI – not all EDW’s use
column-store
› Dimensionally structured data for large data sets
› Enterprise storage not commodity storage
› Expensive premium products
TeraData, Vertica, SQL Server PDW, etc
Some major companies offers commodity hardware for low price
customer
Some major companies offers services in addition to products
› Demanding skill-set
› Enterprise ready
6. Days of Data Past - Part III
› NoSql data stores and their impact
› Not relational and Not ACID compliant
› Four types
Key-value stores (KV)
Document stores
Graph database stores
Wide column stores
› Relatively cheaper products
› Commodity storage not enterprise storages
› Demanding and scarce skill-set
› Not Enterprise ready
NewSql data stores as an alternative to NoSql
Relational and ACID compliant
SQL driven so that existing SQL investments are intact
7. Three V’s
› Volume
Large volumes 100 TB or more currently
Expecting above benchmark in future
› Velocity
How quickly data accumulates
How quickly your data makes sense
Batch, near-time, real-time
Batch vs Interactive
› Variety
Various data sources
Structured data – relational, ERP, CRM
Semi-structured data – click streams, weblogs, geographical,
social
Unstructured data – sensor, textual, machine generated
8. “Data Lake” Architecture
› Modern Data Architecture
Provides a shared service for broad insight across a
large, diverse data set at efficient scale according to
HortonWorks
A unified data architecture which integrated to
enterprise end-to-end solutions according to TeraData
› Cater to support 3V driven big data opportunities
› Raw data of unrecognized value
› Read-first schema
9. Individual Experience vs Collective Experience
› Need to treat as individuals instead a mass collective
› Predictive modeling to recommend individual’s best
“intent”
› Implementing Process communication models (PCM) to
give better individualized customer service
Listening to particular song by particular artist via mobile
Calling to a call center
› Privacy concerns – main obstacle in current big data trend
10. Business Cases
› Medical or Healthcare
› Entertainment
› Forensics
› Financial
› Retail
11. Use Cases
› Medical or Healthcare
Find a cure to a disease based on individual’s medical history,
behavior patterns, food and drug consumption, and similar
patients’ data
› Entertainment
Provide a recommendation engine for IMDB or Netflix for
individual’s viewing patterns
› Forensics
Capture a serial killer from historical murder data in CSI.
Similarly avoid more incidents in the similar killer pattern
› Financial
Provide a predictive financial model for Wall Street stock market
fluctuations based on historical shareholder patterns
› Retail
12. Coming of Hadoop
› GFS and Google’s MapReduce engine and
publishing of white papers by Google
› Yahoo team who first to decode the white papers and
create HDFS and an MR engine to scale out yahoo
search
› Creation of Hadoop 1.0 (Generation 1) in 2006 and
commit for Production level Hadoop by Yahoo
› Spawning the HortonWorks company in 2011 from a
set of Yahoo employees and move towards
Enterprise hardening
› Spawning multiple Hadoop distros as products
13. Evolution of Hadoop
› Hadoop 1.x (Generation 1)
Data Management – HDFS for redundant data storage from various sources and MapReduce
to process the data
Data Access Layer (batch, near-time, real-time) - to access data simultaneously in multiple
ways
› Hadoop 2.x (Generation 2)
Introducing YARN for Data Management layer
Governance and Integration for Enterprise level – data loading, execute data policies, data
management – introducing Apache Falcon
Security – authentication and authorization at a layered and secured way – Apache Knox
Operations – deploy, monitor and manage the platform as whole – introducing Apache Ambari
› Enterprise Hadoop
Deployment choice – Physical, virtual, cloud; distro Windows or Linux; distro product
HortonWorks or Cloudera or other
Presentation and Applications – Enable existing and new applications to generate value from
Hadoop
Enterprise management and security – empower existing proven enterprise tools to integrate
with Hadoop
Services or Product choice - YARN-enabling always –on forever running services with Apache
Slider
15. “Other than” Hadoop, HDFS
› HDFS-like storage systems with similar
MapReduce engines
› MapR (uses an NFS)
Has cloud support too
› EMC, NetApp, CleverState, Symentic
› IBM’s BigInsight (kind of distro of Cloudera which
is intern distro of Hadoop)
› SAP’s HANA suite
› Of course proprietary GFS which HDFS is based
on originally
Editor's Notes
In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?“
In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics"
In April 2002, the International Council for Science: Committee on Data for Science and Technology (CODATA)[9] started the Data Science Journal
ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably