The document discusses how big data is driving the need for new database technologies that can handle large, unstructured datasets and provide real-time analytics capabilities that traditional relational databases cannot support. It outlines the limitations of relational databases for big data and analyzes emerging technologies like Hadoop, NoSQL databases, and cloud computing that are better suited for storing, processing, and analyzing large volumes of diverse data types. The document also examines the infrastructure, architectural, and market requirements for big data platforms and products.
The web-conference hosted by CRISIL Global Research & Analytics on “Big Data’s Big Impact on Businesses” on January 29, 2013, saw participation from senior officials of global multinationals from 9 countries. The presentation described how data analytics is helping businesses make “evidence-based” decisions, thereby creating a positive impact. It also spoke about the opportunities opening up in the Big Data space in India and across the globe.
Hosted by:
Sanjeev Sinha, President, CRISIL Global Research & Analytics
Gaurav Dua, Director & Practice Leader (Technology, Media & Telecom), CRISIL Global Research & Analytics
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
The rise of “Big Data” on cloud computing: Review and open research issues
Paper Link: https://www.researchgate.net/publication/264624667_The_rise_of_Big_Data_on_cloud_computing_Review_and_open_research_issues
The web-conference hosted by CRISIL Global Research & Analytics on “Big Data’s Big Impact on Businesses” on January 29, 2013, saw participation from senior officials of global multinationals from 9 countries. The presentation described how data analytics is helping businesses make “evidence-based” decisions, thereby creating a positive impact. It also spoke about the opportunities opening up in the Big Data space in India and across the globe.
Hosted by:
Sanjeev Sinha, President, CRISIL Global Research & Analytics
Gaurav Dua, Director & Practice Leader (Technology, Media & Telecom), CRISIL Global Research & Analytics
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
The rise of “Big Data” on cloud computing: Review and open research issues
Paper Link: https://www.researchgate.net/publication/264624667_The_rise_of_Big_Data_on_cloud_computing_Review_and_open_research_issues
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
Michael Wrinn
Research Program Director, University Research Office,
Intel Corporation
Jason Dai
Engineering Director and Principal Engineer,
Intel Corporation
Micro Strategies has been providing business driven technology solutions for 28 years, specializing in advanced infrastructure and business systems integration.
Move your desktop to the cloud for $1 day Desktone
This webinar will explore the reasons for changing traditional desktop computing strategies, why cloud-hosted virtual desktops are a compelling solution for many businesses, and how to leverage cloud-hosted desktops for Windows 7 migrations, mobile and departmental workers, and disaster recovery scenarios.
Updated Policy Brief: Cooperatives Bring Fiber Internet Access to Rural AmericaEd Dodds
Originally published in 2017, our report, Cooperatives Fiberize Rural America: A Trusted Model for the Internet Era, focuses on cooperatives as a proven model for deploying fiber optic Internet access across the country. An update in the spring of 2019 included additional information about the rate co-ops are expanding Internet service, and now we’ve updated it again, with a new map and personal stories from areas where co-ops have drastically impacted local life.
Digital Inclusion and Meaningful Broadband Adoption Initiatives Colin Rhinesm...Ed Dodds
This report presents findings from a national study of digital inclusion organizations that help low-income individuals and families adopt high-speed Internet service. The study looked at eight digital inclusion organizations across the United States that are working at the important intersection between making high-speed Internet available and strengthening digital skills—two essential and interrelated components of digital inclusion, which is focused on increasing digital access, skills, and relevant content.
Innovation Accelerators:
Defining Characteristics Among Startup Assistance Organizations by C. Scott Dempwolf, Jennifer Auer, and
Michelle D’Ippolito
Optimal Solutions Group, LLC
College Park, MD 20740
contract number SBAHQ -13-M-0197
Release Date: October 2014
This report was developed under a contract with the Small Business Administration, Office of Advocacy, and contains information and analysis that were reviewed by officials of the Office of Advocacy. However, the final conclusions of the report do not necessarily reflect the views of the Office of Advocacy.
Executive Summary. Thriving in a Turbulent, Technological and Transformed Global Economy | Council on Competitiveness 900 17th Street, NW, Suite 700 Washington, D.C. 20006 T 202 682 4292 Compete.org
America has long been a nation of innovators. The United States is the birthplace of the Internet, which today connects three billion people around the world. American scientists and engineers sequenced the human genome, invented the semiconductor, and sent humankind to the moon. And America is not done yet. For an advanced economy such as the United States, innovation is a wellspring of economic growth. While many countries can grow by adopting existing technologies and business practices, America must continually innovate because our workers and firms are often operating at the technological frontier. Innovation is also a powerful tool for addressing our most pressing challenges as a nation, such as enabling more Americans to lead longer, healthier lives, and accelerating the transition to a low-carbon economy.
Report to the President and Congress Ensuring Leadership in Federally Funded ...Ed Dodds
In the report, PCAST focuses on eight R&D areas: cybersecurity, IT and health, Big Data and data-intensive computing, IT and the physical world, privacy protection, cyber-human systems, high capability computing, and foundational computing research. All of these areas help to achieve the Nation’s priorities. For example, Big Data, IT and the physical world, and high-capability computing are essential contributors to addressing issues within energy and the environment.
Data Act Federal Register Notice Public Summary of ResponsesEd Dodds
Summary of Responses to the Treasury Bureau of the Fiscal Service Notice in the Federal Register on 9/26/2014 for “Public Input on the Establishment of Financial Data Standards (Data Exchange)
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
2. Abstract IMEX
RESEARCH.COM
• NextGen Infrastructure for Big Data
• This session will appeal to Business Planning, Marketing, Technology System Integrators and Data
Center Managers seeking to understand the drivers behind the demand for and rise of Big Data.
• Abstract
• The internet has spawned an explosion in data growth in the form of data sets, called Big Data, that
are so large they are difficult to store, manage and analyze using traditional RDBMS which are
tuned for Online Transaction Processing (OLTP) only. Not only is this new data heavily
unstructured, voluminous and streams rapidly and difficult to harness but even more importantly, the
infrastructure cost of HW and SW required to crunch it using traditional RDBMS, to derive any
analytics or business intelligence online (OLAP) from it, is prohibitive.
• To capitalize on the Big Data trend, a new breed of Big Data technologies (such as Hadoop and
others) many companies have emerged which are leveraging new parallelized processing,
commodity hardware, open source software and tools to capture and analyze these new data sets
and provide a price/performance that is 10 times better than existing Database/Data
Warehousing/Business Intelligence Systems.
• Learning Objectives
• The presentation will illustrate the existing operational challenges businesses face today using
RDBMS systems despite using fast access in-memory and solid state storage technologies. It
details how IT is harnessing the emergent Big Data to manage massive amounts of data and new
techniques such as parallelization and virtualization to solve complex problems in order to empower
businesses with knowledgeable decision-making.
• It lays out the rapidly evolving big data technology ecosystem - different big data technologies from
Hadoop, Distributed File Systems, emerging NoSQL derivatives for implementation in private and
hybrid cloud-based environments, Storage Infrastructure Requirements to Store, Access, Secure,
Prepare for analytics and visualization of data while manipulating it rapidly to derive business
intelligence online, to run businesses smartly.
2 2
4. Harnessing Big Data for Business Insights IMEX
RESEARCH.COM
Data Big Data Business
Sources Infrastructure Insights
Information is at the center of Majority of data growth is being 80% of world’s data is unstructured
New Wave of opportunity driven by unstructured data driven by rise in Mobility devices,
and billions of large objects collaboration machine generated data.
4
5. Corporate Need: Business Perf... OptimizationIMEX RESEARCH.COM
Unstructured Big Data can provide Next Gen
Analytics to help businesses make informed,
better decision in:
▪ Product Strategy
▪ Targeting Sales
▪ Just-In-Time Supply-Chain Economics
▪ Business Performance Optimization
▪ Predictive Analytics & Recommendations
▪ Country Resources Management
5
7. Corporate Need: Business Insights IMEX
RESEARCH.COM
Item Issue . Solution
Information Exploding A unified information/content
Volume: Digital Content doubling every 18 storage methodology that enables
Store months. Velocity: >80% growth driven from users to manage the volume, velocity and
unstructured data. Variety: sources of data variety of information from multiple sources
changing
Complexity in "managing" A solution portfolio of tools and
information. - Need to classify, services to manage all types of
Manage synchronize, aggregate, integrate, share, information in a hybrid storage environment
transform, profile, move, cleanse, protect,
retire
Current solutions limited to BI tools Build/buy packaged Real-Time
Analyze focused on structured and lagging Predictive Analytical Solutions for
information unstructured analytics tools
Multiple access methods needed to Centralized share, collaborate and
Collaborate meet needs of a diverse audience. act on insights anytime, anywhere on any
device.
Ability to understand how the Model Information on current
Model/ Adapt information impacts the business. operations w/potential strategy
How to transfer to action. impact. Leverage Tech. to adapt.
7
8. Opportunity: Converting Big Data Deluge
into Predictive Analytics & Insights
IMEX
RESEARCH.COM
Big Data Predictive Analytics
Personal Location Services Data Generated by TB/Year
Navigation Devices 600
Navigation Apps on Phones 20
Smart Phone Opt-In Tracking 1000
Geo Targeted Ads 20
People Locator (Emergency Calls/Search..) 10
Location based Services (e.g.Games) 5
Other 45
Total (Est.) 1700
8
9. Issues with Existing RDBMS IMEX
RESEARCH.COM
Key Issues with RDBMS Technologies
Handling Mixed Unstructured Data
Unstructured Scalability - RDBMs don’t handle non-tabular data
Data
(Notorious for doing a poor job on recursive data structure)
Legacy Archaic Architecture
- RDBMS don’t parallelize well to accommodate
Fault commodity HW clusters
Cost Tolerance
Speed
Big Data - Seek time of physical Storage has not kept pace with
Analytics network speed improvements
EU Adhoc Latency Scale
Analytics
- Difficult to scale-out RDBMS efficiently – Clustering
beyond few servers notoriously hard
Integration
Deep - Data processing tasks need to combine data from non-
Availability Insights related sources, over a network
Volume
- Data volumes have grown from 10s GB >100s TB >
PBs in recent years. Existing Tabular RDBMS can’t
handle such large DBs
9
10. Issues with Existing RDBMS IMEX
RESEARCH.COM
Present RDBMS struggling to Store & Analyze Big Data
10
11. Big Data - Database Solutions IMEX
RESEARCH.COM
11
12. Big Data – The New Face of DBs IMEX
RESEARCH.COM
Big Data Paradigm - The New face of DB Systems
• Adopts Schema-Free Architecture
• Can do away with Legacy Relational DB Systems
Some data have sparse attributes, do not need relational property
• Key Oriented Queries
Some data stored/retrieved mainly by primary key, w/o complex joins
• Trade-off of Consistency, Availability & Partition Tolerance
• Scale Out, not up, - Online Load balancing cluster growth
12
16. Innovations – DB SW Technologies IMEX
RESEARCH.COM
Tech Innovation 1985 1990 1995 2000 2005 2010 2015
OLTP Transactions Open Source /
Rows Locking Optimizer Parallel Query Clustering XML Grid
DB SW Hadoop
OLAP- Analytics Materialized Bit Mapped
Indexing Partitioning Columnar In-Memory Query Binding
DB SW View Index
Multi-core/
Hardware 32 bit SMP NUMA 64 bit Flash MPP
Blades
Columnar MPP
Big Data Multi-core
In-Memory Visualization
OLTP Database Innovation Progress Big Data: Analytics DB Technology Impact
100000 10000
100,000 100000
10000 10000
1000 10,000
$/GB 1000
TPMc / Processor
1,000 DW Size TB
$ / TPMc
1000 100
100
$/TPMc
100 10
100 TPMc/Processor 10
10 1
10 1 0.1
1
0.01
1 0.1
.1
0
0.001
0.1 0.01 .01
0 0.0001
1985 1990 1995 2000 2005 2010 2015 1985 1990 1995 2000 2005 2010 2015
16
17. Big Data - Key Requirements IMEX
RESEARCH.COM
Types of Data Organizations Analyze
59%
Customer/member data 68%
44%
Transactional data from applications 68%
69%
Application Logs 37%
64%
Other Types of Event Data 23%
41%
Network Monitoring/Network Traffic 33%
33%
Online Retail Transactions 34%
51%
Other Log Files 26%
28%
Call Data Records 32%
46%
Web Logs 21%
36%
Text data from social media and online 15%
36%
Search logs 11%
18%
Trade/quote data 15% Hadoop
18%
Intelligence/defense data 11% Non Hadoop
21%
Multimedia (audio/video/images) 9%
8%
Weather 3%
3%
Smartmeter data 6%
3%
Other (please specify) 5%
17
18. Big Data – Architectural Goals IMEX
RESEARCH.COM
Meet Requirements of V3 Meet Enterprise Criterion
Big Data
Platform
Analyze Data in Native Format
18
19. Big Data - Market Requirements IMEX
RESEARCH.COM
Unified system: Pre-integrated for Ease of Installation and Management
• Platform – Large Scale Indexing Pre-integrated using Hadoop Foundation,
• Integrated Text Analytics - Address Unstructured Data
• Usability - User Friendly Admin Console including HDFS Explorer, Query
Languages
• Enterprise Class Features – Provisioning, Storage, Scheduler, Advance Security
• Supports search-centric, document-based XML data model
o store documents within a transactional repository.
• Schema-Free:
o No advance knowledge of the document structure (its "schema") needed
o Index words and values from each of the loaded documents together with its
document structure.
• Standard commodity hardware leveraged
19
20. Big Data – Market Requirements IMEX
RESEARCH.COM
Architectural
• Shared-nothing clustered DB architecture
o programmable and extensible application servers.
• Support massive scalability to petabytes of source data
• Support open-source XQuery- and XSLT-driven architecture
• Simple to Deploy, Develop and Manage (UI & Restful Interface)
• Support extreme mixed workloads - a wide variety of data types including arbitrarily
hierarchical data structures, images, waveforms, data logs etc.
• Support thousands of geographically dispersed on--line users and programs
executing variety of requests from ad hoc queries to strategic analysis
• Loading data before declaring or discovering its structure
• Load data in batch and streaming fashion
• Integrate data from multiple sources during load process at very high rates Spread
I/O and data across instances
• Provide consistent performance with linear cost
• Leverage Open Source SW Lo Costs, Multiple Sources, Hadoop Foundation Tools
• Connectivity with Oracle DB, Teradata Warehouse, JDBC Connectivity,
20
21. Big Data - Market Requirements IMEX
RESEARCH.COM
Real Time Analytics Execution
• Execute “streaming” analytic queries in real time on incoming load data
• Updating data in place at full load speeds
• Scheduling and execution of complex multi-hundred node workflows
• Join a billion row dimension table to a trillion row fact table without pre-clustering
the dimension table with the fact table
Performance
• Analyze data, at very high rates >GB/sec
• Predictable Sub-ms response time for highly constrained standard SQL queries
Availability
• Ability to configure without any single point of failure
• Auto-Failover Extreme High Availability
o Automated failover and process continuation without operational interruption
when processing nodes fail
21
22. Big Data – Product Metrics Choices IMEX RESEARCH.COM
PB
Data Set Size TB
GB
Transaction
Machine
Data Structure Unstructured
Other
Transaction
Access/Use Search
Big Data Analytics
- Appliance
Product Parallel Cluster < 1K
Processing
Metrics Cluster > 1K
In-Memory
Memory Flash
Columnar
DB Technique Zero Sharing
No SQL
Text
Data Image
Cataloging Audio
SW Video
22
23. Advantage: Big Data Products IMEX
RESEARCH.COM
Characteristic Legacy Paradigm Big Data Paradigm
Structure •Transactional/Corporate •Unstructured/Derivative/Internet
Mode •Data Collection •Data Analysis
Focus •Find Answers •Find Questions
Facility •Reportive / What Happened? •Analytic / Why did it Happen?
Predictive / What will Happen Next?
Opportunity •Very Small Growth •Massive Growth
Players •Legacy Players •Agile Start Ups, well funded
Impact •Analyze Existing Businesses •Create New Businesses
23
24. Advantage: Big Data Products IMEX
RESEARCH.COM
Characteristic Traditional RDBMS Big Data/MapReduce
Data Size •GB •PB
Access •Interactive •Batch/Near Real-Time
Latency •Low •High
Data Updates •Read & Write Many Times •Write Once Read Many Times
Schema/Structure •Static Schema •Dynamic Schema
Language •SQL •UQL/Procedural (Java,C++..)
Integrity •High •Not 100%
Works Well for •Process Intensive Jobs •Data Intensive Jobs
Works Well w Data Size •Gigabytes •Petabytes
Data/Processing •Low Latency/High BW – precursor to •Sends Code to Data, instead of
Interactions success. Ntwk. BW can be a Sending Data to other Nodes
bottleneck causing nodes to be idle (Requiring Lower BW in Cluster)
Fault Tolerance •Coordinating Processes with Node •Fault Tolerant for HW/SW Failures
Failures – a challenge
Access •Interactive •Batch/Near Real-time
Scaling •Non-linear •Linear
Pgm-Distribution of Jobs •Difficult •Simple & Effective
24
26. Big Data Stack IMEX
RESEARCH.COM
Merging Hadoop innovations into Nextgen DBMS
Data
Analytics Platform Applications
Sources
Advanced
RHive Analytics
Oracle
DBA
PerfMon, Query
IBM
Teradata…
Data Enterprise Plan
Importer
Hive Workflow
ETL
Databases Oracle-to-Hive Ad-hoc query
Data
Data
Sources
Importer
OLAP
Data Server
Data Store exporter
Streaming (Hadoop) OLTP
Data Server
Collector
Devices
Rest/
JSON
Search API
Real-time
Admin Center Queries
26
28. Big Data - Hadoop Architecture IMEXRESEARCH.COM
Data Flow SQL Programming
(Pig) (Hive) Languages
Coordination
Management
Distributed Computing Framework
(Zookeeper)
Computations
(MapReduce)
(HMS)
Metadata Column-Stg Tabular
(HCatalog) (HBase) Storage
Hadoop Distributed File System Object
(HDFS) Storage
28
29. Big Data Connectors to EDW/BI IMEX
RESEARCH.COM
BI Framework - Interoperable with Enterprise Data Warehousing
EDW BI APPLICATIONS
(Query, Analytics, Reporting, Statistics)
Big Data Big Data Orchestration Framework
(Connectors) HBase Avro Flume ZooKeeper
Big Data Access Framework
Backup & Recovery
Hive Sqoop
Management
Pig
Security
Network
Big Data Storage Framework Big Data Processing Framework
(HDFS) (MapReduce)
Hypervisor/VMs
Operating System
Servers
29
30. Big Data Infrastructure – Map IMEX
RESEARCH.COM
Reduce
Data Flow SQL
(Pig) (Hive)
MapReduce
Coordination
Management
Computing
(Zookeeper)
(HMS)
Framework)
Metadata MetadataColumn-Stg
(HCatalog)CV (HBase)
(HCatalog)CV
Hadoop
Distributed
File System
(HDFS)
Map Reduce
A Distributed Computing Model
Typical Pipeline:
Input>Map>Shuffle/Sort>Reduce>Output
Easy to Use , Developer writes few functions,
Moves compute to Data
Schedules work on HDFS node with data
Scans through data, reducing seeks
Automatic Reliability and re-execution on failure
30
31. Big Data Infrastructure – HDFS IMEX
RESEARCH.COM
HDFS Architecture
Actively Maintaining High Availability
Persistent Namespace Metadata & Journal Persistent Namespace Metadata & Journal
Namespace
NFS
Namespace
Namespace
NFS Namespace
State Block State Block
NFS
Map NFS Map
Name Node Name Node
Hierarchical Namespace
File name >> Block IDs >> Block Locations 0. Bad/ 3. Block
Lost Block 1. Replicate Periodically Check
Heartbeats & Block Reports Received
Block Checksums
b1 b2 b1 b2 b1 b2 b1 b2 b1 b2 b1 b2 b1 b2 b1 b2
b3 Data Node b3 Data Node b3 Data Node b3 Data Node b3 Data Node b3 Data Node
2. Copy b3 Data Node b3 Data Node
JBOD JBOD JBOD JBOD JBOD JBOD JBOD JBOD
Block ID >> Data Block ID >> Data
Horizontally Scale I/O & Storage Horizontally Scale I/O & Storage
HDFS
Immutable File System – Read, Write, Sync/Flush – No random writes
Storage Server used for Computation – Move Computation to Data
Fault Tolerant & Easy Management – Built In Redundancy, Tolerates Disk & Node Failure, Auto-Managing
addition/removal of nodes, One operator/8K nodes
Not a SAN but high bandwidth network access to data via Ethernet
Used typically to Solve problems not feasible with traditional systems: Large Storage Capacity >100PB raw,
Large IO/computational BW >4K node/cluster, scale by adding commodity HW, Cost ~$1.5/GB incl. MR cluster
32. Hadoop Distributed File System IMEX
RESEARCH.COM
HDFS Architecture
HDFS Characteristics
• Based on Google GFS (Google File
System)
• Redundant Storage for massive
amounts of data
• Data is distributed across all
nodes at load time – efficient
MapReduce processing
• Runs on commodity hardware –
assumes high failure rate for
components
• Works well with lots of large files
• Built around Write once – Read
many times”
• Large Streaming Reads – Not
random access
• High Throughtput more important
than low latency
32
34. Key Technologies Required for Big Data IMEX
RESEARCH.COM
Key Technologies Required for Big Data
• Cloud Infrastructure
• Virtualization
• Networking
• Storage
o In-Memory Data Base (Solid State Memory)
o Tiered Storage Software (Performance Enhancement)
o Deduplication (Cost Reduction)
o Data Protection (Back Up, Archive & Recovery)
34
35. Cloud Infrastructure for Big Data IMEX
RESEARCH.COM
Service Cloud Services Providers Examples
Public - BT, Telstra, T-Systems France Telecom
Public – Mutitenancy,OnDemand
Providers Private - On Premises, Enterprise
Private –
Hybrid – IBM/Cloudburst,
Hybrid – Interoperable P2P
Examples
SaaS Software-as-a-Service
- Servers, Network, Storage
eMail - Yahoo!,Google…
Collaboration - Facebook,Twitter …
- Management, Reporting Bus.Apps - SalesForce, GoogleApps, Intuit…
Examples
PaaS Platform Tools & Services
- Deploy developed platforms
Amazon EC2
Force.com
ready for Application SW on Navitaire
Cloud Aware Infrastructure
Examples
Infrastructure HW &
IaaS
Amazon S3
Services Nirvanix
- Servers, Network, Storage
- Management, Reporting
35
36. Cloud Infrastructure for Big Data IMEX
RESEARCH.COM
Cloud Computing
Private Cloud Hybrid Public Cloud Service
Enterprise Cloud Providers
SaaS Applications
SaaS
App App App App App
SLA
SLA
SLA
SLA
SLA
Platform Tools & Services
Management
Python
Ruby
.Net
….. …..
EJB
PHP
PaaS
Operating Systems
Virtualization
IaaS
Resources (Servers, Storage, Networks)
Application’s SLA dictates the Resources Required to meet specific
requirements of Availability, Performance, Cost, Security, Manageability etc.
37. Private Cloud Requirements for Big Data IMEX
RESEARCH.COM
Public
Operational Flexibility
Cloud
Storage
Private
Cloud
Storage
Service Catalog
VZ
Choose pre-defined IT Services/user/dept.
App Define SLA to efficiently meet services
Silos
Costs
Self- Service
Private
Service Cloud Analytics
Access Resources on demand to Storage Monitor & Analyze Usage for
speed deployment and delivery Charge back
Scale Resource Up/down Interactively auto-tune
to optimize their usage, performance
Release when not needed Availability with SLA
Automation requirements
Automated Provisioning
- Moving Data & Processes Seamlessly
Allocation, Self Tuning of Resources to
meet Workload Requirements
38. Virtualization: Workloads Consolidation IMEX
RESEARCH.COM
• A single server 1.5x larger than
standard 2-way server will handle
consolidated load of 6 servers.
• VZ manages the workloads +
important apps get the compute
resources they need automatically
w/o operator intervention.
• Physical consolidation of 15-20:1
is easily possible
• Reasonable goal for VZ x86
servers – 40-50% utilization on
large systems (>4way), rising as
dual/quad core processors
becomes available
• Savings result in Real Estate,
Power & Cooling, High Availability,
Hardware, Management
Source: Dan Olds & IMEX Research 2009
47. Best Practices – Storage in Big Data Apps IMEX
RESEARCH.COM
Goals & Implementation
• Establish Goals for SLAs (Performance/Cost/Availability), BC/DR (RPO/RTO)
& Compliance
• Increase Performance for DB, OLTP and OLAP Apps:
−Random I/O > 20x , Sequential I/O Bandwidth > 5x
−Remove Stale data from Production Resources to improve performance
• Use Partitioning Software to Classify Data
−By Frequency of Access (Recent Usage) and
−Capacity (by percent of total Data) using general guidelines as:
−Hyperactive (1%), Active (5%), Less Active (20%), Historical (74%)
Implementation
• Optimize Tiering by Classifying Hot & Cold Data
−Improve Query Performance by reducing number of I/Os
−Reduce number of Disks Needed by 25-50% using advance compression software
achieving 2-4x compression
• Match Data Classification vs.Tiered Devices accordingly
−Flash, High Perf Disk, Low Cost Capacity Disk, Online Lowest Cost Archival
Disk/Tape
• Balance Cost vs. Performance of Flash
−More Data in Flash > Higher Cache Hit Ratio > Improved Data Performance
• Create and Auto-Manage Tiering (Monitoring, Migrations, Placements)
without manual intervention
47
49. Best Practices: Cached Storage IMEX
RESEARCH.COM
App/DB Application Improvement over Cached
Server vs.HDD only
Oracle OLTP Benchmarks 681%
SQL Server OLTP 1251%
Benchmark
Neoload (Web Server 533%
Simulation
SysBench (MySQL OLTP 150%
LSI MegaRAID CacheCade Pro 2.0
Server)
Response Time TPS
660 700 655
700
600 600
500 500
400 330 373
400
300
300
200
200
100 0
0 100
0
All HDD Smart Flash Persist 0
Cache Data on All HDD Smart Flash Persist Data
Warpdrive Cache on Warpdrive
49
50. Big Data Targets: Analytics IMEX
RESEARCH.COM
Key Industries Benefitting from Big Data Analytics
Global IT Spending by Industry Verticals 2010-15 $B
CAGR % (2010-15)
5 Year Cum Global IT Spending 2010-15 ($B)
50
51. Big Data Targets – Storage Infrastructure IMEX
RESEARCH.COM
Value Potential of Using Big Data by Data Intensive Verticals
Data Intensity by Industry Vertical
Banking & Financial Services
Media & Entrertainment
Healthcare Providers
Professional Services
Telecommunications
Pharma, Life Sciences & Medical Pdcts
Retail & Wholesales
Utilities
Ind'l Elex & Electrical Equipment
SW Publishing & Internet Srvcs
Consumer Products
Insurance
Transportation
Energy
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Installed Terabytes/$M Rev 2010
51
52. Big Data Targets: Storage Infrastructure IMEX
RESEARCH.COM
Data Stored by Large US Enterprises
Big Data Storage Potential
Data Stored by Large US Enterprises
Big Data Storage Potential
Data Stored by Large US Enterprise Stored Data TB/Firm
Stored Data by Industry
(in US 2009 PB) (>1K Employees US)
Discrete Manafacturing 14% 231
Government 12% 150
Communications and Media 10% 825
Process manufacturing 10% 1,507
Banking 9% 536
Health Care Providers 6% 801
Securities & Investment Srvcs 6% 870
Professional Services 6% 319
Retail 5% 697
Education 4% 278
Insurance 4% 3,866
Transportation 3% 370
Wholesale 3% 1,931
Utilities 3% 831
Resource Industries 2% 1,792
Consumer and Recreational
2% 1,312
Services
Construction 1% 967
52
53. Big Data Targets: Savings w Open Source IMEX
RESEARCH.COM
Legacy BI vs. Open Source Big Data Analytics
53
54. Rise of Big Data Adoption IMEX
RESEARCH.COM
54