SlideShare a Scribd company logo
1 of 76
Download to read offline
BIG DATA AND ANALYTICS CONCEPTS
BOHITESH MISRA
CHIEF TECHNOLOGY OFFICER
IT STARTUPS
BOHITESH.MISRA@GMAIL.COM
The Internet of Things connects
all manner of end-points, a
treasure trove of data
Networks and device
proliferation enable
access to a massive and
growing amount of
traditionally siloed
information
Analytics and business
intelligence tools empower
decision makers as never
before by extracting and
presenting meaningful
information in real-time,
helping us be more
predictive than reactive
BUILDING A CONNECTED AND SMART ECOSYSTEM:
A ROADMAP TO BUSINESS NIRVANA
IoT Big Data Analytics
GARTNER HYPE CYCLE - 2015
CONTENT
1. What is Big Data
2. Characteristic of Big Data
3. Why Big Data
4. How it is Different
5. Big Data sources
6. Tools used in Big Data
7. Application of Big Data
8. Risks of Big Data
9. Benefits of Big Data
10.How Big Data Impact on IT
11.Future of Big Data
BIG DATA
• Big Data may well be the Next Big Thing in the IT world.
• The first organizations to embrace big data were online
and startup firms. Firms like Google, eBay, LinkedIn, and
Facebook were built around big data from the beginning.
• Like many new information technologies, big data can bring
about dramatic cost reductions, substantial improvements in
the time required to perform a computing task, or new
product and service offerings.
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• but having data bigger it requires different approaches, Techniques,
tools and architecture
• an aim to solve new problems or old problems in a better way
• Big Data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques.
WHAT IS BIG DATA?
THREE CHARACTERISTICS OF BIG DATA
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types
BIG DATA - VOLUME
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a
single flight across the US.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
BIG DATA - VELOCITY
• Clickstreams and ad impressions capture user behavior at millions of events
per second
• high-frequency stock trading algorithms reflect market changes within
microseconds
• machine to machine processes exchange data between billions of devices
• infrastructure and sensors generate massive log data in real-time
BIG DATA - VARIETY
• Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
• Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.
• Big Data analysis includes different types of data
STORING BIG DATA
❖Analyzing your data characteristics
• Selecting data sources for analysis
• Eliminating redundant data
• Establishing the role of NoSQL
❖Overview of Big Data stores
• Data models: key value, graph, document, column-family
• Hadoop Distributed File System
• HBase
• Hive
PROCESSING BIG DATA
❖Integrating disparate data stores
• Mapping data to the programming framework
• Connecting and extracting data from storage
• Transforming data for processing
• Subdividing data in preparation for Hadoop MapReduce
❖Employing Hadoop MapReduce
• Creating the components of Hadoop MapReduce jobs
• Distributing data processing across server farms
• Executing Hadoop MapReduce jobs
• Monitoring the progress of job flows
WHY BIG DATA
•FB generates 10TB daily
•Twitter generates 7TB of data Daily
•IBM claims 90% of today’s
stored data was generated
in just the last two years.
BIG DATA SOURCES
Users
Application
Systems
Sensors
Large and growing files
(Big data files)
DATA GENERATION POINTS - EXAMPLES
Mobile Devices
Readers/Scanners
Science facilities
Microphones
Cameras
Social Media
Programs/ Software
BIG DATA ANALYTICS
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased revenue
• Where processing is hosted?
• Distributed Servers / Cloud
• Where data is stored?
• Distributed Storage
• What is the programming model?
• Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
• High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
• Analytic / Semantic Processing
TYPES OF TOOLS USED IN BIG-DATA
Application Of Big Data analytics
Homeland
Security
Smarter Healthcare
Integrated and smart
patient care systems
and processes
Retail & Multi-channel
sales
Highly personalized
customer experience
across channels and
devices
Telecom
Manufacturing
Intelligent
interconnectivity across
the enterprise for
enhanced control, speed
and efficiency
Traffic Control
Trading Analytics
Search Quality
Log Analysis
Finance & Banking
Seamless customer
experience across all
banking channels
HOW BIG DATA IMPACTS ON IT
• Big data is a troublesome force presenting opportunities with challenges to IT
organizations.
• By 2016 4.4 million IT jobs in Big Data ; 1.9 million is in US itself
• India will require a minimum of 1 lakh data scientists in the next couple of
years in addition to data analysts and data managers to support the Big Data
space.
POTENTIAL VALUE OF BIG DATA
• $300 billion potential annual
value to US health care.
• $600 billion potential annual
consumer surplus from using
personal location data.
• 60% potential in retailers’
operating margins.
BENEFITS OF BIG DATA
•Real-time big data isn’t just a process for storing petabytes
or exabytes of data in a data warehouse, It’s about the
ability to make better decisions and take meaningful
actions at the right time.
•Fast forward to the present and technologies like Hadoop
give you the scale and flexibility to store data before you
know how you are going to process it.
•Technologies such as MapReduce,Hive and Impala enable
you to run queries without changing the data structures
underneath.
BENEFITS OF BIG DATA
• Our newest research finds that organizations are using big data to target
customer-centric outcomes, tap into internal data and build a better information
ecosystem.
• Big Data is already an important part of the $64 billion database and data
analytics market
• It offers commercial opportunities of a comparable scale to enterprise software in
the late 1980s
• And the Internet boom of the 1990s, and the social media explosion of today.
FUTURE OF BIG DATA
• $15 billion on software firms only specializing in data management
and analytics.
• This industry on its own is worth more than $100 billion and growing
at almost 10% a year which is roughly twice as fast as the software
business as a whole.
• The McKinsey Global Institute estimates that data volume is growing
40% per year, and will grow 44x between 2009 and 2020.
INDIA – BIG DATA
• Gaining attraction and market
• Huge market opportunities for IT services (82.9% of revenues) and
analytics firms (17.1 % )
• Current market size is $200 million. By 2015 $1 billion
• The opportunity for Indian service providers lies in offering services
around Big Data implementation and analytics for global
multinationals
BIG DATA ANALYTICS TECHNOLOGIES
NoSQL : non-relational or at least non-SQL database solutions
such as HBase (also a part of the Hadoop ecosystem),
Cassandra, MongoDB, Riak, CouchDB, and many others.
Hadoop: It is an ecosystem of software packages, including
MapReduce, HDFS, and a whole host of other software
packages
THE FOUR PILLARS FOR AN EFFECTIVE BIG DATA STRATEGY
Storage User Experience
Digital intelligence and
Analytics
Content Discovery
and Management
Just these segments account for more than $10 billion in served, addressable markets.
MOTIVATION FOR SPECIALIZED BIG DATA SYSTEMS
• Cost of data storage is dropping, but rate of data capture is soaring
• Sources: online/digital, communications, messaging, usage, transactions…
• Furthermore, need for real-time data-driven insights is also more urgent
• Traditional data warehouses and RDBMS systems cannot keep up
• They are unable to capture, manage and optimize the volume and diversity of data
marketers are seeking to harness today
• Structured, unstructured, and semi-structured data are all essential ingredients in
today’s marketing mix; traditional systems cannot handle this
• Big Data systems: cluster-based, commodity priced, distributed
computing database management system
• Most often based on Hadoop, but usable without MapReduce programming skills
• Key features: linear scalability, parallel computing, node redundancy, and
centralized access to data
• Server clusters behave like a massive single mainframe: What traditional
databases do in months, a Big Data management system can do in hours
INTERNET OF THINGS
&
PREDICTIVE ANALYTICS
INTERNET OF THINGS
• Each “thing” or connected device is part of the digital shadow of a person
• For there to be a market in the internet of things, two things must be true:
1) The “thing” in question must provide utility to the human, and
2) The digital shadow must provide value to an enterprise.
MARKET
• The “market” is made up of many parts :
➢From wearable to drivable to home and
➢Industrial sensors and controllers, and
• Each part is made up of segments :
➢Innovators,
➢Early adopters,
➢Pragmatists,
➢Conservatives, and
➢Laggards across many industries.
PREDICTIVE ANALYTICS
• From the data streams that implement the “digital shadows” of people, we
can use predictive analytics to understand their needs and behavior better
than ever before.
• Every new dimension of data increases the predictive power, enabling
enterprises to answer the question “what does the human want?”
INTERNET OF THINGS & PREDICTIVE ANALYTICS
• Transforming the internet of things and its sibling, predictive analytics, to be
programmable by the same labor pool that has developed the apps which drove
the mobile revolution makes basic economic sense.
• Types of data generated by the internet of things is coupled with :
➢data analysis
➢data discovery tools and
➢ techniques to help business leaders identify emerging developments such as machines that
might need maintenance :
to prevent costly breakdowns or
 sudden shifts in customer or
market conditions that might signal some action a company should take.
• The internet of things, the physical world will become a networked information system—
through sensors and actuators embedded in real physical objects and linked through
wired and wireless networks via the internet protocol.
• This holds special value for manufacturing:
➢The potential for connected physical systems to improve productivity in the production process and
➢The supply chain is huge.
• Consider processes that govern themselves, where smart products can take corrective
action to avoid damages and where individual parts are automatically replenished.
• Such technologies already exist and could drive the fourth industrial revolution—
following the steam engine, the conveyor belt (assembly line - think ford model t), and the
first phase of it and automation technology.
EXAMPLE 1 : AUTO INSURANCE
• The first-order vector was a connected accelerometer offered to drivers :
➢ to improve their insurance rates based on proven “safe driving” habits.
• Through this digital shadow, the insurance provider can make much better
actuarial predictions than through the coarse-grained data they had before
➢age,
➢gender, and
➢ traffic violations.
• This is interesting in the same way the blackberry was interesting - a basic
capability adopted for basic business improvement.
• The second-order vector is much stronger :
➢the ability to transform the insurance market to better meet the needs of customers while
changing the rules of competition.
➢based on real-time driving information insurance companies can :
▪ move to a real-time spot-pricing model driven by an exchange (not unlike the stock exchange),
▪ bidding on drivers and
▪ providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive fast
tomorrow? Pay a little more but don’t worry about your “permanent record”.
• These outcomes are all based on tying the internet of things to predictive
analytics.
EXAMPLE 2 : HEALTH CARE
• The first-order vector is similar, a wearable accelerometer offered to patients :
➢ To improve traceability of their compliance with their exercise prescription,
➢Enabling better outcomes for cardiac patients.
➢Unlike prescription refills, exercise compliance has been untraceable before, so this digital
shadow is a breakthrough for medicine.
• Similar developments exist in digestible sensors within medications :
➢which activate only on contact with stomach acid,
➢providing higher truth and
➢better granularity than a monthly refill.
• In second-order vector in healthcare ,the ability to combine multiple streams of
information that were previously invisible has the potential to drive better health
outcomes through provably higher patient compliance.
• Sorting these data streams at scale will allow health providers and health insurance
companies to rapidly iterate health protocols across a population of humans, augmenting
human expertise with predictive analytics.
• Outcome-based analysis based on predictive models built from data can reduce :
➢waste,
➢error rates, and
➢lawsuits while driving better margins.
• Larger exchanges of this type of data will tend to :
➢ perform better,
➢creating a more effective market and
➢ a better pool of empirical research for science.
EXAMPLE 3 : AUTO COMPANIES
• They have installed thousands of "black boxes" inside their prototype and field
testing vehicles to capture second by second data from the dozens of control units
which manage today's automobiles.
• These boxes simply plug into the vehicle's on-board diagnostic (obd) port which is
typically located under the front dashboard of all cars.
• They collect 500-750 different vehicle performance parameters that add up to
terabytes of data in hours!
• The intent of the automakers for installing these boxes is to collect data which their
engineers can later analyze to fix bugs and improve on existing designs.
• For example, one car manufacturer found out from this data that their minivan batteries
would end up in a recall.
➢The problem was an underpowered alternator - it was not able to fully recharge the batteries
because the most common drive cycle for this particular minivan was less than 3 miles.
➢As a result, there appeared to be a lot of complaints about dead batteries and the company was
potentially facing the recall of millions of minivans which had this alternator.
➢The boxes collect information about driving cycles and this data was really useful in understanding
the real reason behind the dead batteries.
➢The test vehicles which had short drive cycles were the ones which reported dead batteries! simply
changing the alternator to higher capacity could fix the problem.
➢Now it was an easy fix to extend this solution to the entire fleet.
ENDLESS OPPORTUNITY
The opportunities are literally endless,
➢Ranging from early fault detection (predicting when a particular component
is likely to fail)
➢To automatically adjusting driving route based on traffic pattern
predictions.
The ultimate test of predictive analytics in the internet of things is of course fully
autonomous systems, such as :
➢the nissan car of 2020 or
➢ the google self driving car of today.
In the end all autonomous systems will need the ability to build predictive
capabilities - in other words, machines must learn machine learning!
EXAMPLE 4 : GOOGLE’S SELF DRIVING CAR
Google claims that their self-driving car of today has logged more
than 300,000 miles with almost zero incidence of accidents.
The one time a minor crash did occur was when the car was rear-
ended by a human-driven car!
So, when the technology is fully mature, it is not just parking valets
who become obsolete, other higher paying professions such as
automotive safety systems experts may also need to look for other
options!
Predictive analytics is the enabler that will make this happen.
EXAMPLE 5 : JET AIRLINER
• A jet airliner generates 20 terabytes of diagnostic data per hour of flight.
• The average oil platform has 40,000 sensors, generating data 24/7.
• M2M is now generating enormous volumes of data and is testing the capabilities of
traditional database technologies.
• To extract rich, real-time insight from the vast amounts of machine-generated data,
companies will have to build a technology foundation with speed and scale because raw
data, whatever the source, is only useful after it has been transformed into knowledge
through analysis.
• Investigative analytics tools enable interactive, ad-hoc querying on complex big data sets
to identify patterns and insights and can perform analysis at massive scale with precision
even as machine-generated data grows beyond the petabyte scale
FINDING RIGHT ANALYTICS DATABASE TECHNOLOGY
• To find the right analytics database technology to capture, connect, and drive
meaning from data, companies should consider the following requirements:
➢ Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to :
▪ load quickly and easily,
▪ and must dynamically query,
▪ analyze, and
▪ communicate m2m information in real-time, without huge investments in it administration, support, and tuning.
➢Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools can’t
▪ be constrained by data schemas that limit the number and
▪ type of queries that can be performed.
This type of deeper analysis also cannot be constrained by tinkering or time-
consuming manual configuration (such as indexing and managing data partitions) to
create and change analytic queries.
➢Efficient Compression : Efficient data compression is key to enabling M2M data management within :
▪ A network node,
▪ Smart device, or
▪ Massive data center cluster.
Better compression allows :
▪ For less storage capacity overall,
▪ As well as tighter data sampling and
▪ Longer historical data sets,
▪ Increasing the accuracy of query results.
➢Ease Of Use And Cost : Data analysis must be :
▪ Affordable, Easy-to-use, and
▪ Simple to implement in order to justify the investment.
This demands low-touch solutions that are optimized to deliver :
▪ Fast analysis of large volumes of data,
▪ With minimal hardware, Administrative effort, and
▪ Customization needed to set up or
▪ Change query and reporting parameters.
EXAMPLE 6 : UNION PACIFIC RAILROAD
• The railroad is using sensor and analytics technologies to predict and prevent train derailments,
• For example, the company has placed infrared sensors on every 20 miles of its tracks to gather 20
million temperature readings of train wheels each day to look for signs of overheating, which is a
sign of impending failure.
• Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels.
• Data from such physical measurements are sent via fiber optic lines to union pacific’s data centers.
• Complex pattern-matching algorithms and analytics are used to identify irregularities, allowing
union pacific experts to determine within minutes of capturing the data whether a driver should
pull a train over for inspection or reduce its speed until it reaches the next station to be repaired.
HOW TO ANALYZE MACHINE AND SENSOR DATA
• Capture and refine data from heating, ventilation, and air conditioning (hvac) systems in
20 large buildings around the world using the hortonworks data platform, and how to
analyze the refined sensor data to maintain optimal building temperatures.
• Sensor data - A sensor is a device that measures a physical quantity and transforms it
into a digital signal. sensors are always on, capturing data at a low cost, and powering
the “internet of things.”
• Potential uses of sensor data
➢Sensors can be used to collect data from many sources, such as:
➢To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or
airplane engines. This data can be used for predictive analytics, to repair or replace these items
before they break.
➢To monitor natural phenomena such as meteorological patterns, underground pressure during oil
extraction, or patient vital statistics during recovery from a medical procedure.
APACHE HADOOP - HDFS
OUTLINE
• Architecture of Hadoop Distributed File System
• Hadoop usage
• Ideas for Hadoop related research
HADOOP, WHY?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound
HIVE, WHY?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
• Need tables, schemas, partitions, indices
• SQL is highly popular
• Need for an open data format
– RDBMS have a closed data format
– flexible schema
• Hive is a Hadoop subproject!
Hadoop
What is Hadoop?
 It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to process it
Hadoop Includes
 HDFS a distributed filesystem
 Map/Reduce HDFS implements this programming model. It is an offline
computing engine
Concept
Moving computation is more efficient than moving large data
• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+ terabytes
• One computer can read 30-35 MB/sec from disk ~four months to read the web
• same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
• communication and coordination
• recovering from machine failure
• status reporting
• debugging
• optimization
• locality
WHO USES HADOOP?
• Facebook
• Amazon/A9
• Google
• IBM
• New York Times
• Yahoo!
• PowerSet
COMMODITY HARDWARE
Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
GOALS OF HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where data
resides
– Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
Secondary
NameNode
Client
HDFS Architecture
NameNode
DataNodes
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log
DISTRIBUTED FILE SYSTEM
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
HDFS – HADOOP DISTRIBUTED FILE SYSTEM
HADOOP CLUSTER ARCHITECTURE
• Map/Reduce Master “Jobtracker”
• Accepts MR jobs submitted by users
• Assigns Map and Reduce tasks to
Tasktrackers
• Monitors task and tasktracker status,
reexecutes tasks upon failure
• Map/Reduce Slave “Tasktrackers”
• Run Map and Reduce tasks upon
instruction from the Jobtracker
• Manage storage and transmission of
intermediate output.
NAMENODE METADATA
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
DATANODE
• A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
• One replica on local node, another replica on a remote rack,
Third replica on local rack, Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients access data directly from
DataNode
– Throughput of file system scales nearly linearly with the number of nodes.
DATA MODEL
DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
NAMENODE FAILURE
• A single point of failure – new version has a secondary
namenode
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
HADOOP MAP/REDUCE
• The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
• Common design pattern in data processing
cat * | grep | sort | unique -c | cat > file
input | map | shuffle | reduce | output
• Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries
EXAMPLE - HADOOP AT FACEBOOK
• Production cluster
• 4800 cores, 600 machines, 16GB per machine – April 2009
• 8000 cores, 1000 machines, 32 GB per machine – July 2009
• 4 SATA disks of 1 TB each per machine
• 2 level network hierarchy, 40 machines per rack
• Total cluster size is 2 PB, projected to be 12 PB in Q3 2009
• Test cluster
• 800 cores, 16GB each
DATA FLOW
Web Servers
Scribe Servers
Network
Storage
Hadoop ClusterOracle RAC MySQL
HADOOP AND HIVE USAGE
• Statistics :
• 15 TB uncompressed data ingested per day
• 55TB of compressed data scanned per day
• 3200+ jobs on production cluster per day
• 80M compute minutes per day
• Barrier to entry is reduced:
• 80+ engineers have run jobs on Hadoop platform
• Analysts (non-engineers) starting to use Hadoop through Hive
BID DATA LEARNING PATH
• 1. Understand the difference between various data handling techniques like OLTP,
OLAP, Data Mining, Data Warehoue, Data Mart, etc.
• 2. Understand various visualization techniques like Bar Chart, Heat Map, Tree Map,
Density Map, etc.
• 3. Understand Data Mining / Analytics algorithms.
• 4. Identify various sources of data and identify data elements that need to be
focused upon for Analytics.
• 6. Understand data quality checks and ensure data quality. Without a quality data
analytics may deviate a lot from actual scenario.
BID DATA LEARNING PATH
• 7. Answer why I need a distributed system.
• 8. Study various data handling techniques provided by NoSQL databases like Mongo
DB, Cassandra, etc.
• 9. Find out how Hadoop or related Big Data techniques can be used for distributed
data by using horizontal scalability techniques.
• 10. Finalize algorithms that will run on top of this data and identify tools or develop
program for these algorithms.
• 11. Use visualization techniques learnt in step 2 above to present the output.
• 12. Keep on making the tool/program more and more intelligent as a continuous
process.
Note: A combination of Relational and NoSQL databases may be required for
performing required analytics and/or generating visualizations.
CONCLUSION
• Why commodity hardware ?
because cheaper
designed to tolerate faults
• Why HDFS ?
network bandwidth vs seek latency
• Why Map reduce programming model?
parallel programming
large data sets
moving computation to data
single compute + data cluster
• Hadoop Log Analysis
• Failure prediction and root cause analysis
• Hadoop Data Rebalancing
• Based on access patterns and load
• Best use of flash memory?
• Design new topology based on commodity hardware
MORE IDEAS FOR FURTHER DISCUSSION AND
RESEARCH
USEFUL LINKS
•HDFS Design:
• http://hadoop.apache.org/core/docs/current/hdfs_design.html
•Hadoop API:
• http://hadoop.apache.org/core/docs/current/api/
•Hive:
• http://hadoop.apache.org/hive/
THANK YOU DATA SCIENTISTS !

More Related Content

What's hot (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big Data
Big DataBig Data
Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data
Big DataBig Data
Big Data
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
Big Data
Big DataBig Data
Big Data
 
Big data
Big dataBig data
Big data
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 

Similar to Big data and analytics

Similar to Big data and analytics (20)

Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptx
 
Big Data
Big DataBig Data
Big Data
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
big data
big data big data
big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
Bigdata
BigdataBigdata
Bigdata
 
BigDataFinal.pptx
BigDataFinal.pptxBigDataFinal.pptx
BigDataFinal.pptx
 
Big data
Big dataBig data
Big data
 
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docxBIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
 

More from Bohitesh Misra, PMP

Innovation in enterpreneurship_2021
Innovation in enterpreneurship_2021Innovation in enterpreneurship_2021
Innovation in enterpreneurship_2021Bohitesh Misra, PMP
 
Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021Bohitesh Misra, PMP
 
Building castles on sand - Project Management in distributed project environment
Building castles on sand - Project Management in distributed project environmentBuilding castles on sand - Project Management in distributed project environment
Building castles on sand - Project Management in distributed project environmentBohitesh Misra, PMP
 
Disruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
Disruptive technologies - Session 4 - Biochip Digital twin Smart FabricsDisruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
Disruptive technologies - Session 4 - Biochip Digital twin Smart FabricsBohitesh Misra, PMP
 
Disruptive technologies - Session 3 - Green it_Smartdust
Disruptive technologies - Session 3 - Green it_SmartdustDisruptive technologies - Session 3 - Green it_Smartdust
Disruptive technologies - Session 3 - Green it_SmartdustBohitesh Misra, PMP
 
Disruptive technologies - Session 2 - Blockchain smart_contracts
Disruptive technologies - Session 2 - Blockchain smart_contractsDisruptive technologies - Session 2 - Blockchain smart_contracts
Disruptive technologies - Session 2 - Blockchain smart_contractsBohitesh Misra, PMP
 
Disruptive technologies - Session 1 - introduction
Disruptive technologies - Session 1 - introductionDisruptive technologies - Session 1 - introduction
Disruptive technologies - Session 1 - introductionBohitesh Misra, PMP
 
Business analytics why now_what next
Business analytics why now_what nextBusiness analytics why now_what next
Business analytics why now_what nextBohitesh Misra, PMP
 
Internet of Things (IoT) based Solar Energy System security considerations
Internet of Things (IoT) based Solar Energy System security considerationsInternet of Things (IoT) based Solar Energy System security considerations
Internet of Things (IoT) based Solar Energy System security considerationsBohitesh Misra, PMP
 

More from Bohitesh Misra, PMP (10)

Innovation in enterpreneurship_2021
Innovation in enterpreneurship_2021Innovation in enterpreneurship_2021
Innovation in enterpreneurship_2021
 
Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021
 
Building castles on sand - Project Management in distributed project environment
Building castles on sand - Project Management in distributed project environmentBuilding castles on sand - Project Management in distributed project environment
Building castles on sand - Project Management in distributed project environment
 
Disruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
Disruptive technologies - Session 4 - Biochip Digital twin Smart FabricsDisruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
Disruptive technologies - Session 4 - Biochip Digital twin Smart Fabrics
 
Disruptive technologies - Session 3 - Green it_Smartdust
Disruptive technologies - Session 3 - Green it_SmartdustDisruptive technologies - Session 3 - Green it_Smartdust
Disruptive technologies - Session 3 - Green it_Smartdust
 
Disruptive technologies - Session 2 - Blockchain smart_contracts
Disruptive technologies - Session 2 - Blockchain smart_contractsDisruptive technologies - Session 2 - Blockchain smart_contracts
Disruptive technologies - Session 2 - Blockchain smart_contracts
 
Disruptive technologies - Session 1 - introduction
Disruptive technologies - Session 1 - introductionDisruptive technologies - Session 1 - introduction
Disruptive technologies - Session 1 - introduction
 
What is data science ?
What is data science ?What is data science ?
What is data science ?
 
Business analytics why now_what next
Business analytics why now_what nextBusiness analytics why now_what next
Business analytics why now_what next
 
Internet of Things (IoT) based Solar Energy System security considerations
Internet of Things (IoT) based Solar Energy System security considerationsInternet of Things (IoT) based Solar Energy System security considerations
Internet of Things (IoT) based Solar Energy System security considerations
 

Recently uploaded

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Big data and analytics

  • 1. BIG DATA AND ANALYTICS CONCEPTS BOHITESH MISRA CHIEF TECHNOLOGY OFFICER IT STARTUPS BOHITESH.MISRA@GMAIL.COM
  • 2. The Internet of Things connects all manner of end-points, a treasure trove of data Networks and device proliferation enable access to a massive and growing amount of traditionally siloed information Analytics and business intelligence tools empower decision makers as never before by extracting and presenting meaningful information in real-time, helping us be more predictive than reactive BUILDING A CONNECTED AND SMART ECOSYSTEM: A ROADMAP TO BUSINESS NIRVANA IoT Big Data Analytics
  • 4. CONTENT 1. What is Big Data 2. Characteristic of Big Data 3. Why Big Data 4. How it is Different 5. Big Data sources 6. Tools used in Big Data 7. Application of Big Data 8. Risks of Big Data 9. Benefits of Big Data 10.How Big Data Impact on IT 11.Future of Big Data
  • 5. BIG DATA • Big Data may well be the Next Big Thing in the IT world. • The first organizations to embrace big data were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. • Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings.
  • 6. • ‘Big Data’ is similar to ‘small data’, but bigger in size • but having data bigger it requires different approaches, Techniques, tools and architecture • an aim to solve new problems or old problems in a better way • Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. WHAT IS BIG DATA?
  • 7.
  • 8. THREE CHARACTERISTICS OF BIG DATA Volume • Data quantity Velocity • Data Speed Variety • Data Types
  • 9. BIG DATA - VOLUME •A typical PC might have had 10 gigabytes of storage in 2000. •Today, Facebook ingests 500 terabytes of new data every day. •Boeing 737 will generate 240 terabytes of flight data during a single flight across the US. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
  • 10. BIG DATA - VELOCITY • Clickstreams and ad impressions capture user behavior at millions of events per second • high-frequency stock trading algorithms reflect market changes within microseconds • machine to machine processes exchange data between billions of devices • infrastructure and sensors generate massive log data in real-time
  • 11. BIG DATA - VARIETY • Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media. • Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. • Big Data analysis includes different types of data
  • 12. STORING BIG DATA ❖Analyzing your data characteristics • Selecting data sources for analysis • Eliminating redundant data • Establishing the role of NoSQL ❖Overview of Big Data stores • Data models: key value, graph, document, column-family • Hadoop Distributed File System • HBase • Hive
  • 13. PROCESSING BIG DATA ❖Integrating disparate data stores • Mapping data to the programming framework • Connecting and extracting data from storage • Transforming data for processing • Subdividing data in preparation for Hadoop MapReduce ❖Employing Hadoop MapReduce • Creating the components of Hadoop MapReduce jobs • Distributing data processing across server farms • Executing Hadoop MapReduce jobs • Monitoring the progress of job flows
  • 14. WHY BIG DATA •FB generates 10TB daily •Twitter generates 7TB of data Daily •IBM claims 90% of today’s stored data was generated in just the last two years.
  • 15. BIG DATA SOURCES Users Application Systems Sensors Large and growing files (Big data files)
  • 16. DATA GENERATION POINTS - EXAMPLES Mobile Devices Readers/Scanners Science facilities Microphones Cameras Social Media Programs/ Software
  • 17. BIG DATA ANALYTICS • Examining large amount of data • Appropriate information • Identification of hidden patterns, unknown correlations • Competitive advantage • Better business decisions: strategic and operational • Effective marketing, customer satisfaction, increased revenue
  • 18. • Where processing is hosted? • Distributed Servers / Cloud • Where data is stored? • Distributed Storage • What is the programming model? • Distributed Processing (e.g. MapReduce) • How data is stored & indexed? • High-performance schema-free databases (e.g. MongoDB) • What operations are performed on data? • Analytic / Semantic Processing TYPES OF TOOLS USED IN BIG-DATA
  • 19. Application Of Big Data analytics Homeland Security Smarter Healthcare Integrated and smart patient care systems and processes Retail & Multi-channel sales Highly personalized customer experience across channels and devices Telecom Manufacturing Intelligent interconnectivity across the enterprise for enhanced control, speed and efficiency Traffic Control Trading Analytics Search Quality Log Analysis Finance & Banking Seamless customer experience across all banking channels
  • 20. HOW BIG DATA IMPACTS ON IT • Big data is a troublesome force presenting opportunities with challenges to IT organizations. • By 2016 4.4 million IT jobs in Big Data ; 1.9 million is in US itself • India will require a minimum of 1 lakh data scientists in the next couple of years in addition to data analysts and data managers to support the Big Data space.
  • 21. POTENTIAL VALUE OF BIG DATA • $300 billion potential annual value to US health care. • $600 billion potential annual consumer surplus from using personal location data. • 60% potential in retailers’ operating margins.
  • 22. BENEFITS OF BIG DATA •Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time. •Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. •Technologies such as MapReduce,Hive and Impala enable you to run queries without changing the data structures underneath.
  • 23. BENEFITS OF BIG DATA • Our newest research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem. • Big Data is already an important part of the $64 billion database and data analytics market • It offers commercial opportunities of a comparable scale to enterprise software in the late 1980s • And the Internet boom of the 1990s, and the social media explosion of today.
  • 24. FUTURE OF BIG DATA • $15 billion on software firms only specializing in data management and analytics. • This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole. • The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020.
  • 25. INDIA – BIG DATA • Gaining attraction and market • Huge market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % ) • Current market size is $200 million. By 2015 $1 billion • The opportunity for Indian service providers lies in offering services around Big Data implementation and analytics for global multinationals
  • 26. BIG DATA ANALYTICS TECHNOLOGIES NoSQL : non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB, and many others. Hadoop: It is an ecosystem of software packages, including MapReduce, HDFS, and a whole host of other software packages
  • 27. THE FOUR PILLARS FOR AN EFFECTIVE BIG DATA STRATEGY Storage User Experience Digital intelligence and Analytics Content Discovery and Management Just these segments account for more than $10 billion in served, addressable markets.
  • 28. MOTIVATION FOR SPECIALIZED BIG DATA SYSTEMS • Cost of data storage is dropping, but rate of data capture is soaring • Sources: online/digital, communications, messaging, usage, transactions… • Furthermore, need for real-time data-driven insights is also more urgent • Traditional data warehouses and RDBMS systems cannot keep up • They are unable to capture, manage and optimize the volume and diversity of data marketers are seeking to harness today • Structured, unstructured, and semi-structured data are all essential ingredients in today’s marketing mix; traditional systems cannot handle this • Big Data systems: cluster-based, commodity priced, distributed computing database management system • Most often based on Hadoop, but usable without MapReduce programming skills • Key features: linear scalability, parallel computing, node redundancy, and centralized access to data • Server clusters behave like a massive single mainframe: What traditional databases do in months, a Big Data management system can do in hours
  • 30. INTERNET OF THINGS • Each “thing” or connected device is part of the digital shadow of a person • For there to be a market in the internet of things, two things must be true: 1) The “thing” in question must provide utility to the human, and 2) The digital shadow must provide value to an enterprise.
  • 31. MARKET • The “market” is made up of many parts : ➢From wearable to drivable to home and ➢Industrial sensors and controllers, and • Each part is made up of segments : ➢Innovators, ➢Early adopters, ➢Pragmatists, ➢Conservatives, and ➢Laggards across many industries.
  • 32. PREDICTIVE ANALYTICS • From the data streams that implement the “digital shadows” of people, we can use predictive analytics to understand their needs and behavior better than ever before. • Every new dimension of data increases the predictive power, enabling enterprises to answer the question “what does the human want?”
  • 33. INTERNET OF THINGS & PREDICTIVE ANALYTICS • Transforming the internet of things and its sibling, predictive analytics, to be programmable by the same labor pool that has developed the apps which drove the mobile revolution makes basic economic sense. • Types of data generated by the internet of things is coupled with : ➢data analysis ➢data discovery tools and ➢ techniques to help business leaders identify emerging developments such as machines that might need maintenance : to prevent costly breakdowns or  sudden shifts in customer or market conditions that might signal some action a company should take.
  • 34. • The internet of things, the physical world will become a networked information system— through sensors and actuators embedded in real physical objects and linked through wired and wireless networks via the internet protocol. • This holds special value for manufacturing: ➢The potential for connected physical systems to improve productivity in the production process and ➢The supply chain is huge. • Consider processes that govern themselves, where smart products can take corrective action to avoid damages and where individual parts are automatically replenished. • Such technologies already exist and could drive the fourth industrial revolution— following the steam engine, the conveyor belt (assembly line - think ford model t), and the first phase of it and automation technology.
  • 35. EXAMPLE 1 : AUTO INSURANCE • The first-order vector was a connected accelerometer offered to drivers : ➢ to improve their insurance rates based on proven “safe driving” habits. • Through this digital shadow, the insurance provider can make much better actuarial predictions than through the coarse-grained data they had before ➢age, ➢gender, and ➢ traffic violations. • This is interesting in the same way the blackberry was interesting - a basic capability adopted for basic business improvement.
  • 36. • The second-order vector is much stronger : ➢the ability to transform the insurance market to better meet the needs of customers while changing the rules of competition. ➢based on real-time driving information insurance companies can : ▪ move to a real-time spot-pricing model driven by an exchange (not unlike the stock exchange), ▪ bidding on drivers and ▪ providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive fast tomorrow? Pay a little more but don’t worry about your “permanent record”. • These outcomes are all based on tying the internet of things to predictive analytics.
  • 37. EXAMPLE 2 : HEALTH CARE • The first-order vector is similar, a wearable accelerometer offered to patients : ➢ To improve traceability of their compliance with their exercise prescription, ➢Enabling better outcomes for cardiac patients. ➢Unlike prescription refills, exercise compliance has been untraceable before, so this digital shadow is a breakthrough for medicine. • Similar developments exist in digestible sensors within medications : ➢which activate only on contact with stomach acid, ➢providing higher truth and ➢better granularity than a monthly refill.
  • 38. • In second-order vector in healthcare ,the ability to combine multiple streams of information that were previously invisible has the potential to drive better health outcomes through provably higher patient compliance. • Sorting these data streams at scale will allow health providers and health insurance companies to rapidly iterate health protocols across a population of humans, augmenting human expertise with predictive analytics. • Outcome-based analysis based on predictive models built from data can reduce : ➢waste, ➢error rates, and ➢lawsuits while driving better margins. • Larger exchanges of this type of data will tend to : ➢ perform better, ➢creating a more effective market and ➢ a better pool of empirical research for science.
  • 39. EXAMPLE 3 : AUTO COMPANIES • They have installed thousands of "black boxes" inside their prototype and field testing vehicles to capture second by second data from the dozens of control units which manage today's automobiles. • These boxes simply plug into the vehicle's on-board diagnostic (obd) port which is typically located under the front dashboard of all cars. • They collect 500-750 different vehicle performance parameters that add up to terabytes of data in hours!
  • 40. • The intent of the automakers for installing these boxes is to collect data which their engineers can later analyze to fix bugs and improve on existing designs. • For example, one car manufacturer found out from this data that their minivan batteries would end up in a recall. ➢The problem was an underpowered alternator - it was not able to fully recharge the batteries because the most common drive cycle for this particular minivan was less than 3 miles. ➢As a result, there appeared to be a lot of complaints about dead batteries and the company was potentially facing the recall of millions of minivans which had this alternator. ➢The boxes collect information about driving cycles and this data was really useful in understanding the real reason behind the dead batteries. ➢The test vehicles which had short drive cycles were the ones which reported dead batteries! simply changing the alternator to higher capacity could fix the problem. ➢Now it was an easy fix to extend this solution to the entire fleet.
  • 41. ENDLESS OPPORTUNITY The opportunities are literally endless, ➢Ranging from early fault detection (predicting when a particular component is likely to fail) ➢To automatically adjusting driving route based on traffic pattern predictions. The ultimate test of predictive analytics in the internet of things is of course fully autonomous systems, such as : ➢the nissan car of 2020 or ➢ the google self driving car of today. In the end all autonomous systems will need the ability to build predictive capabilities - in other words, machines must learn machine learning!
  • 42. EXAMPLE 4 : GOOGLE’S SELF DRIVING CAR Google claims that their self-driving car of today has logged more than 300,000 miles with almost zero incidence of accidents. The one time a minor crash did occur was when the car was rear- ended by a human-driven car! So, when the technology is fully mature, it is not just parking valets who become obsolete, other higher paying professions such as automotive safety systems experts may also need to look for other options! Predictive analytics is the enabler that will make this happen.
  • 43. EXAMPLE 5 : JET AIRLINER • A jet airliner generates 20 terabytes of diagnostic data per hour of flight. • The average oil platform has 40,000 sensors, generating data 24/7. • M2M is now generating enormous volumes of data and is testing the capabilities of traditional database technologies. • To extract rich, real-time insight from the vast amounts of machine-generated data, companies will have to build a technology foundation with speed and scale because raw data, whatever the source, is only useful after it has been transformed into knowledge through analysis. • Investigative analytics tools enable interactive, ad-hoc querying on complex big data sets to identify patterns and insights and can perform analysis at massive scale with precision even as machine-generated data grows beyond the petabyte scale
  • 44. FINDING RIGHT ANALYTICS DATABASE TECHNOLOGY • To find the right analytics database technology to capture, connect, and drive meaning from data, companies should consider the following requirements: ➢ Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to : ▪ load quickly and easily, ▪ and must dynamically query, ▪ analyze, and ▪ communicate m2m information in real-time, without huge investments in it administration, support, and tuning. ➢Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools can’t ▪ be constrained by data schemas that limit the number and ▪ type of queries that can be performed. This type of deeper analysis also cannot be constrained by tinkering or time- consuming manual configuration (such as indexing and managing data partitions) to create and change analytic queries.
  • 45. ➢Efficient Compression : Efficient data compression is key to enabling M2M data management within : ▪ A network node, ▪ Smart device, or ▪ Massive data center cluster. Better compression allows : ▪ For less storage capacity overall, ▪ As well as tighter data sampling and ▪ Longer historical data sets, ▪ Increasing the accuracy of query results. ➢Ease Of Use And Cost : Data analysis must be : ▪ Affordable, Easy-to-use, and ▪ Simple to implement in order to justify the investment. This demands low-touch solutions that are optimized to deliver : ▪ Fast analysis of large volumes of data, ▪ With minimal hardware, Administrative effort, and ▪ Customization needed to set up or ▪ Change query and reporting parameters.
  • 46. EXAMPLE 6 : UNION PACIFIC RAILROAD • The railroad is using sensor and analytics technologies to predict and prevent train derailments, • For example, the company has placed infrared sensors on every 20 miles of its tracks to gather 20 million temperature readings of train wheels each day to look for signs of overheating, which is a sign of impending failure. • Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels. • Data from such physical measurements are sent via fiber optic lines to union pacific’s data centers. • Complex pattern-matching algorithms and analytics are used to identify irregularities, allowing union pacific experts to determine within minutes of capturing the data whether a driver should pull a train over for inspection or reduce its speed until it reaches the next station to be repaired.
  • 47. HOW TO ANALYZE MACHINE AND SENSOR DATA • Capture and refine data from heating, ventilation, and air conditioning (hvac) systems in 20 large buildings around the world using the hortonworks data platform, and how to analyze the refined sensor data to maintain optimal building temperatures. • Sensor data - A sensor is a device that measures a physical quantity and transforms it into a digital signal. sensors are always on, capturing data at a low cost, and powering the “internet of things.” • Potential uses of sensor data ➢Sensors can be used to collect data from many sources, such as: ➢To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. This data can be used for predictive analytics, to repair or replace these items before they break. ➢To monitor natural phenomena such as meteorological patterns, underground pressure during oil extraction, or patient vital statistics during recovery from a medical procedure.
  • 49. OUTLINE • Architecture of Hadoop Distributed File System • Hadoop usage • Ideas for Hadoop related research
  • 50. HADOOP, WHY? • Need to process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure – Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  • 51. HIVE, WHY? • Need a Multi Petabyte Warehouse • Files are insufficient data abstractions • Need tables, schemas, partitions, indices • SQL is highly popular • Need for an open data format – RDBMS have a closed data format – flexible schema • Hive is a Hadoop subproject!
  • 52. Hadoop What is Hadoop?  It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it Hadoop Includes  HDFS a distributed filesystem  Map/Reduce HDFS implements this programming model. It is an offline computing engine Concept Moving computation is more efficient than moving large data
  • 53. • Data intensive applications with Petabytes of data. • Web pages - 20+ billion web pages x 20KB = 400+ terabytes • One computer can read 30-35 MB/sec from disk ~four months to read the web • same problem with 1000 machines, < 3 hours • Difficulty with a large number of machines • communication and coordination • recovering from machine failure • status reporting • debugging • optimization • locality
  • 54. WHO USES HADOOP? • Facebook • Amazon/A9 • Google • IBM • New York Times • Yahoo! • PowerSet
  • 55. COMMODITY HARDWARE Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 56. GOALS OF HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth • User Space, runs on heterogeneous OS
  • 57. Secondary NameNode Client HDFS Architecture NameNode DataNodes Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log
  • 58. DISTRIBUTED FILE SYSTEM • Single Namespace for entire cluster • Data Coherency – Write-once-read-many access model – Client can only append to existing files • Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes • Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 59.
  • 60. HDFS – HADOOP DISTRIBUTED FILE SYSTEM
  • 61. HADOOP CLUSTER ARCHITECTURE • Map/Reduce Master “Jobtracker” • Accepts MR jobs submitted by users • Assigns Map and Reduce tasks to Tasktrackers • Monitors task and tasktracker status, reexecutes tasks upon failure • Map/Reduce Slave “Tasktrackers” • Run Map and Reduce tasks upon instruction from the Jobtracker • Manage storage and transmission of intermediate output.
  • 62. NAMENODE METADATA • Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data • Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor • A Transaction Log – Records file creations, file deletions. etc
  • 63. DATANODE • A Block Server – Stores data in the local file system – Stores meta-data of a block – Serves data and meta-data to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 64. • Files are broken in to large blocks. – Typically 128 MB block size – Blocks are replicated for reliability • One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed • Understands rack locality – Data placement exposed so that computation can be migrated to data • Client talks to both NameNode and DataNodes – Data is not sent through the namenode, clients access data directly from DataNode – Throughput of file system scales nearly linearly with the number of nodes. DATA MODEL
  • 65. DATA CORRECTNESS • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 66. NAMENODE FAILURE • A single point of failure – new version has a secondary namenode • Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) • Need to develop a real HA solution
  • 67. HADOOP MAP/REDUCE • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 68. EXAMPLE - HADOOP AT FACEBOOK • Production cluster • 4800 cores, 600 machines, 16GB per machine – April 2009 • 8000 cores, 1000 machines, 32 GB per machine – July 2009 • 4 SATA disks of 1 TB each per machine • 2 level network hierarchy, 40 machines per rack • Total cluster size is 2 PB, projected to be 12 PB in Q3 2009 • Test cluster • 800 cores, 16GB each
  • 69. DATA FLOW Web Servers Scribe Servers Network Storage Hadoop ClusterOracle RAC MySQL
  • 70. HADOOP AND HIVE USAGE • Statistics : • 15 TB uncompressed data ingested per day • 55TB of compressed data scanned per day • 3200+ jobs on production cluster per day • 80M compute minutes per day • Barrier to entry is reduced: • 80+ engineers have run jobs on Hadoop platform • Analysts (non-engineers) starting to use Hadoop through Hive
  • 71. BID DATA LEARNING PATH • 1. Understand the difference between various data handling techniques like OLTP, OLAP, Data Mining, Data Warehoue, Data Mart, etc. • 2. Understand various visualization techniques like Bar Chart, Heat Map, Tree Map, Density Map, etc. • 3. Understand Data Mining / Analytics algorithms. • 4. Identify various sources of data and identify data elements that need to be focused upon for Analytics. • 6. Understand data quality checks and ensure data quality. Without a quality data analytics may deviate a lot from actual scenario.
  • 72. BID DATA LEARNING PATH • 7. Answer why I need a distributed system. • 8. Study various data handling techniques provided by NoSQL databases like Mongo DB, Cassandra, etc. • 9. Find out how Hadoop or related Big Data techniques can be used for distributed data by using horizontal scalability techniques. • 10. Finalize algorithms that will run on top of this data and identify tools or develop program for these algorithms. • 11. Use visualization techniques learnt in step 2 above to present the output. • 12. Keep on making the tool/program more and more intelligent as a continuous process. Note: A combination of Relational and NoSQL databases may be required for performing required analytics and/or generating visualizations.
  • 73. CONCLUSION • Why commodity hardware ? because cheaper designed to tolerate faults • Why HDFS ? network bandwidth vs seek latency • Why Map reduce programming model? parallel programming large data sets moving computation to data single compute + data cluster
  • 74. • Hadoop Log Analysis • Failure prediction and root cause analysis • Hadoop Data Rebalancing • Based on access patterns and load • Best use of flash memory? • Design new topology based on commodity hardware MORE IDEAS FOR FURTHER DISCUSSION AND RESEARCH
  • 75. USEFUL LINKS •HDFS Design: • http://hadoop.apache.org/core/docs/current/hdfs_design.html •Hadoop API: • http://hadoop.apache.org/core/docs/current/api/ •Hive: • http://hadoop.apache.org/hive/
  • 76. THANK YOU DATA SCIENTISTS !