5. Large-Scale Data Management
Data Science and Analytics
Managing very large amounts of data and extracting value from it !
Data is the New Gold – Data Mining
6. Big Data - No single standard definition
“Big Data” is the data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
Examples : Google, Yahoo, Facebook, eBay, Amazon and many enterprises…
As per Wikipedia…
“Big data is an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process using on-hand data
management tools or traditional data processing applications.”
What makes data, “Big” Data?
Data always existed; we made it Big.
7. Data generation
• Web data, e-commerce
• Purchases at department and grocery stores
• Bank/Credit Card transactions
• Social Networks
• Health care records
• Satellite imagery and weather modeling
• Many sources of Data
• Available in Public domain
9. Data Approximation
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
10. Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
▫ 44x increase from 2009 to 2020
▫ From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
11. Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types
of data
12. Characteristics of Big Data:
3-Speed (Velocity)
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
17. Harnessing Big Data
• OLTP: Online Transaction Processing (RDBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Solutions)
18. The Model Has Changed
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
20. Challenges in Handling Big Data
• The Bottleneck
▫ New architecture, algorithms, techniques, and solutions are needed
• Need for New talent with technical skills
▫ Experts in using the new technology and dealing with big data
21. • How can we process all that information?
• There are actually two problems
– Large scale "data storage"
– Large scale "data analysis“
Data Processing Scalability
• We are generating more data than ever before
• Fortunately, the size and cost of storage has kept pace
Capacity has increased while price has decreased
Year Capacity (GB) Cost per GB (USD)
1997 2.1 $157
2004 200 $1.05
2014 3,000 $0.036
Disk Capacity and Price
22. Disk Capacity and Performance
• Disk performance has also increased in the last 15 years
• Unfortunately, transfer rates have not kept pace with capacity
Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time
1997 2.1 16.6 126 seconds
2004 200 56.5 59 minutes
2014 3,000 210 3 hours, 58 minutes
23. How does it work ?
Big Data Technologies in use today !
The Competition and Complexity is
Immense…
28. Learning curve for Big Data ?
• Big Data involves not just several tools, but numerous technologies,
methodologies, and mathematical and/or statistical concepts
• These need to be thought, developed, refined, and applied
appropriately to reach a certain goal
• Algorithms and computing languages are required to practically turn
“Big Data in to Applied Intelligence”
• If a flexible mind starts learning concepts like HADOOP and related
topics now, it can rightly be positioned in a project after few months
29. Getting started
• Several perceptions and perspectives
▫ Several tools exist for Big-Data technology
▫ Students need the right direction to get started
• Learn the Concepts and The Platform
▫ How big data is managed in a scalable and efficient way
▫ Hadoop can be a good starting point
• Big-Data courses available in the market
▫ Administrator Courses for Apache Hadoop
▫ Developer Courses for Apache Hadoop
• Data Analyst courses
▫ Statistical tools for managing big data
▫ High-Level Languages: Apache Pig, Hive
▫ Programming Languages: Java, C, Python, R etc.
• Data Science courses
▫ Mahout: Data mining and machine learning tools over big data
▫ Scientific Data Modeling
30. Starting with HADOOP
• Prerequisites of HADOOP platform for learning
• Virtual machine environment
• Any supported or popular Linux distribution
• RHEL, SUSE, Cent OS, Ubuntu or Fedora
• Oracle Java JDK
• HADOOP platform
• Single-node and then clustered with High-Availability
• Cloudera Quickstart VM (CDH 4 or 5)
• Cloudera is one of the pioneers in Big Data technologies
• CDH or Cloudera Distribution for HADOOP available as a VM
• Downloadable from Cloudera website
• Other needed software packages
31. A Brief Introduction to HADOOP
• High Availability Distributed Object Oriented Platform
• Hadoop is the Brain child of Doug Cutting
• Hadoop was originally a nick name of his son’s toy elephant
• Based on Google’s published whitepapers
• Developed by The Apache Software Foundation (http://apache.org)
• Google started in 1990’s. 2000’s brought data management complexities
• In 2004, Google published whitepaper on MapReduce, a framework that
provides a parallel processing model
32. A Brief Introduction to HADOOP
• Google’s technologies namely
1. GFS (Google File System) – A distributed file system
2. MapReduce – A framework for parallel processing
3. BigTable – A Data storage system
• These are reverse engineered and re-engineered by
Apache Software Foundation, and called as:
1. HDFS (Hadoop Distributed File System)
2. MapReduce
3. Apache HBase
44. Some highlights of IT industry
• The term “IT” for Information technology first appeared in 1958
• IT has been a catalyst to areas of science and technology, and businesses
• A movement from IT driven industry to open information society
• We are today a part global village, via internet, which is now a commodity
or a common consumer service to many people
• Rapidly changing technologies
• Fast pace of Innovations and Inventions
• One needs to keep pace with new developments
• Enterprise examples - Blackberry, WhatsApp
45. What’s driving the Big Data Market
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
46. Value of Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
• Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
47. Real-world scenarios
• IMAGINE YOUR BOSS COMES TO YOU AND SAYS:
“HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE
OUR business!”
• What would you do?
• Where would you start?
• And what would you do next?
Use Big Data to make better Business Decisions
48. Some Examples of Big Data Projects
• Consumer product companies monitoring social media like Facebook and
Twitter to get an unprecedented view into customer behavior, preferences,
and product perception.
• Governments are making data public at both the national and
international level for users to develop new applications.
• Sports teams are using data for tracking ticket sales and even for tracking
team strategies.
• Manufacturers are monitoring minute vibration data from their equipment,
to predict the optimal time for component replacement.
• Financial Services organizations are using data mined from customer
interactions to create increasingly relevant and sophisticated offers.
• Advertising and marketing agencies are tracking social media to understand
responsiveness to campaigns, promotions, and other advertising mediums.
• Web-based businesses are developing information products that combine
data gathered from customers to offer more appealing recommendations
and more successful coupon programs.
49. Cloud and Big Data
Most of the traditional IT skills are being moved towards the Cloud and Big
Data platform.
Some related fields:
• Artificial Intelligence
• Distributed computing / super computing
• Business Analytics / Business Intelligence
• Data Analytics / Data Mining
• Projects running on legacy IT infrastructure
50. The Google Public Data Explorer
Makes large datasets easy to explore, visualize and communicate.