At the Technology Trends seminar, with HCMC University of Polytechnics' lecturers, KMS Technology's CTO delivered a topic of Big Data, Cloud Computing, Mobile, Social Media and In-memory Computing.
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
A 3-day interactive workshop for startups involve in Big Data & Analytics in Asia. Introduction to Big Data & Analytics concepts, and case studies in R Programming, Excel, Web APIs, and many more.
DOI: 10.13140/RG.2.2.10638.36162
I've shown you in this ppt, the difference between Data and Big Data. How Big Data is generated, Opportunities with Big Data, Problem occurred in Big Data, solution of that problem, Big Data tools, What is Data Science & how it's related with the Big Data, Data Scientist vs Data Analyst. At last, one Real-life scenario where Big data, data scientists, and data analysts work together.
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
A 3-day interactive workshop for startups involve in Big Data & Analytics in Asia. Introduction to Big Data & Analytics concepts, and case studies in R Programming, Excel, Web APIs, and many more.
DOI: 10.13140/RG.2.2.10638.36162
I've shown you in this ppt, the difference between Data and Big Data. How Big Data is generated, Opportunities with Big Data, Problem occurred in Big Data, solution of that problem, Big Data tools, What is Data Science & how it's related with the Big Data, Data Scientist vs Data Analyst. At last, one Real-life scenario where Big data, data scientists, and data analysts work together.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
Content:
Introduction
What is Big Data?
Big Data facts
Three Characteristics of Big Data
Storing Big Data
THE STRUCTURE OF BIG DATA
WHY BIG DATA
HOW IS BIG DATA DIFFERENT?
BIG DATA SOURCES
BIG DATA ANALYTICS
TYPES OF TOOLS USED IN BIG-DATA
Application Of Big Data analytics
HOW BIG DATA IMPACTS ON IT
RISKS OF BIG DATA
BENEFITS OF BIG DATA
Future of big data
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Big data is a term that describes the large volume of data may be both structured and unstructured.
That inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Introduction to Big Data
Big Data is a massive collection of data that is growing exponentially over time.
It is a data set that is so large and complex that traditional data management tools cannot store or process it efficiently.
Big data is a type of data that is extremely large in size.
Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. The project has a rapidly growing community of users, including Airbnb, FINRA, Netflix, Twitter, and Uber. Kamil Bajda-Pawlikowski explores the key architectural components that allow querying variety of data sources and make Presto uniquely position to be applied in both Hadoop and Cloud use cases. Along the way, Kamil covers Teradata’s recent enhancements in query performance, security integrations, and ANSI SQL coverage and shares the roadmap for 2017 and beyond.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
Content:
Introduction
What is Big Data?
Big Data facts
Three Characteristics of Big Data
Storing Big Data
THE STRUCTURE OF BIG DATA
WHY BIG DATA
HOW IS BIG DATA DIFFERENT?
BIG DATA SOURCES
BIG DATA ANALYTICS
TYPES OF TOOLS USED IN BIG-DATA
Application Of Big Data analytics
HOW BIG DATA IMPACTS ON IT
RISKS OF BIG DATA
BENEFITS OF BIG DATA
Future of big data
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Big data is a term that describes the large volume of data may be both structured and unstructured.
That inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Introduction to Big Data
Big Data is a massive collection of data that is growing exponentially over time.
It is a data set that is so large and complex that traditional data management tools cannot store or process it efficiently.
Big data is a type of data that is extremely large in size.
Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. The project has a rapidly growing community of users, including Airbnb, FINRA, Netflix, Twitter, and Uber. Kamil Bajda-Pawlikowski explores the key architectural components that allow querying variety of data sources and make Presto uniquely position to be applied in both Hadoop and Cloud use cases. Along the way, Kamil covers Teradata’s recent enhancements in query performance, security integrations, and ANSI SQL coverage and shares the roadmap for 2017 and beyond.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
What exactly is big data? The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs. Put simply, big data is larger, more complex data sets, especially from new data sources.
Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
The seminar is about Data warehousing, in here we are gonna discuss about what is data warehousing, comparison b/w database and data warehouse, different data warehouse models.about Data mart, and disadvantages of data warehousing.
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
Watch full webinar here: https://bit.ly/3hgOSwm
Data Lake technologies have been in constant evolution in recent years, with each iteration primising to fix what previous ones failed to accomplish. Several data lake engines are hitting the market with better ingestion, governance, and acceleration capabilities that aim to create the ultimate data repository. But isn't that the promise of a logical architecture with data virtualization too? So, what’s the difference between the two technologies? Are they friends or foes? This session will explore the details.
Author: Toan Le
Topic: Being a software tester is no longer an easy job. It was. More technologies and platforms have emerged, along with more complex applications have been created to serve users’ various expectations while the time to go live is getting much shorter over time. It's not only about desktop or web-based applications but also about mobile, cloud-based applications, IoT and more. It's not only about testing alone anymore. It's about continuous integration and continuous delivery indeed.
How to survive and thrive in this Era of New Technology seems to become a critical question for all of us. Being a Full-stack Tester could be an answer, even though we may have different starting points in this career journey. And, the next considerable questions are: what is it and how to get there?
My presentation is to give you some ideas to answer those questions through my own experience in the path of pursuing Full-stack Tester.
Author: Son Tang - Senior Engineer Manager
Contact Email: sontang@kms-technology.com
Git repo: https://github.com/hunterbmt/react_redux_seminar
Working as a Front-end developer is more challenging than ever since the Front-end part of application is no longer simple tasks. Nowadays, with the increased popularity of Single Page Application (SPA), developing a Front-end application requires more tools, more frameworks and also more attention from software engineers to application architecture so as to make sure high performance and scalability.
When the complexity of your SPA increases, more people have to work on the application at the same time and a larger number of components and UI elements are built. That results in the application scalability becoming a signification problem. Without a good approach, the more complicated our application becomes, the buggier, the more unproductive and low-performing it becomes. React and Redux are one of many technical stacks which provides a lot of support to developers to build a solid SPA in an easy and effective way. They are easy to pick up and to be productive with.
This presention will discuss benefits of using React and Redux as well as how to architect application in order to scale effectively without sacrificing benefits we have from React and Redux.
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOpsKMS Technology
DevOps is a spectacular mish-mash of development and operations processes and practices that has been growing increasingly popular in recent years. With the upward trending rate in adoption comes the need for organizations to fully understand the key practices as well as thoroughly integrating team members, especially testers, throughout the delivery pipeline. Getting started with DevOps practices can be a little tricky when choosing the right tools, people, and processes. In this webinar, we’ll focus on helping you make the switch without diminishing the team’s delivered product quality, so that the transition meets the enterprise objectives of speed and reliability.
Tune in to learn:
The biggest concern when moving to DevOps - and how to handle it
Why you need ‘Coding Testers’
The best tools for the job
The process of failing fast, and its significance to testers
Measuring the transition - recommended metrics
The value of DevOps long-term - efficiency, repeatability & reliability
Don’t worry about failing - it’s a part of the process!
Increase Chances to Be Hired as Software Developers - 2014KMS Technology
KMS Technology, together with Duy Tan University, hold two sessions of their workshop "Increase Chance for Being Hired as Software Developers - 2014" for IT students at Da Nang province.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
4. WHAT IS BIG DATA?
4
“Big data exceeds the reach of commonly used
hardware environments and software tools to
capture, manage, and process it with in a tolerable
elapsed time for its user population.” - Teradata
Magazine article, 2011
“Big data refers to data sets whose size is beyond the
ability of typical database software tools to
capture, store, manage and analyze.” - The McKinsey
Global Institute, 2011
Volume and Variety of Data that is difficult to manage
using traditional data management technology
5. WHAT IS GENERATING BIG DATA?
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
5
6. HOW MUCH DATA?
• 7 billion people
• Google processes 100 PB/day; 3 million servers
• Facebook has 300 PB + 500 TB/day; 35% of world’s
photos
• YouTube 1000 PB video storage; 4 billion views/day
• Twitter processes124 billion tweets/year
• SMS messages – 6.1T per year
• US Cell Calls – 2.2T minutes per year
• US Credit cards - 1.4B Cards; 20B transactions/year
6
7. LOWER COST OF STORAGE
7
What can I buy for $100 (USD) ?
(not adjusted for inflation)
Memory Capacity =
128 GB by 2020
x1420 in 20 years
Disk Capacity =
10 TB by 2020
x1000 in 20 years
8. HOW IS BIG DATA DIFFERENT?
• Automatically generated by a machine
– (e.g. Sensor embedded in an engine)
• Typically an entirely new source of data
– (e.g. Use of the internet)
• Not designed to be friendly
– (e.g. Text streams)
• May not have much values
– Need to focus on the important part
8
9. WHO UTILIZES IT?
• Companies and organizations who can leverage large
scale consumer produced data
– Marketing
– Consumer Markets (retail, airlines, hotels, Amazon, Netflix)
– Social Media (Facebook, Twitter, YouTube, LinkedIn)
– Search Providers (Google, Yahoo, Microsoft)
– People Data Aggregators (LexisNexis, Equifax, Acxiom)
• Other Enterprises are slowly getting into it
– Healthcare
– Financial Institutes
9
11. TYPE OF DATA
• Structured Data (Transactions)
• Text Data (Web Content)
• Semi-structured Data (XML)
• Unstructured Data
– Social Network, SMS, Audio, Video
• Streaming Data
– You can only scan the data once as it travels on network
11
12. WHAT TO DO WITH THESE DATA?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
12
13. RDBMS LIMITATIONS
• Very difficult to scale horizontally (more boxes) as the
best way to scale is vertically by utilizing bigger box
– Physical limited to CPUs, Disk storage, and memory
– Large servers are too expensive and still can’t scale
• Requires structure of tables with rows and columns
– Does not deal well with unstructured data
• Relationships have to be pre-defined through schema
– Difficult to add newly discovered data quickly
13
15. NOSQL CHARACTERISTICS
• Cheap, easy to implement (open source)
– Cluster of cheap commodity servers with cheap storage
• Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be partitioned
– Down nodes can easily be replaced while cluster is operational
– No single point of failure
• Easy to distribute
• Don't require a schema
• Massive Scalability
• Relaxed the data consistency requirement (CAP) –
less locking and resource contengency
15
16. NOSQL – SEVERAL OPTIONS
• Currently 150 implementations and growing
(http://nosql-database.org/)
• Multiple Types based on storage architecture
– Key-Value
– Document
– Column Family
– Graph
16
17. KEY-VALUE STORE
• Values stored in Key-Value Pairs in hashmap
• Distributed across nodes based on key
• Simple Operations: insert, fetch, update, and delete
• Best for storing high volume dataset with low
complexity (simple data model)
• Some of the market leaders:
– Riak
– Amazon Dynamo
– Voldermort
17
19. COLUMN FAMILY STORE
• Stores family of columns
• Columns are stored as Key-Value pair
• A super column is like a catalogue or a collection of other
columns
• Columns within a family can be distributed across nodes
• Supports semi-structured data with high scalability
• Some of the market leaders:
– HBase
– Cassandra
19
21. DOCUMENT STORE
• Supports more complex data model than Key-Value
• Collection of Documents – JSON, XML, other semi-
structured formats
• A document is a key value collection
• Multi-Index support
• Best for storing complex data model but less scalable
• Some of the market leaders:
– MongoDB
– CouchDB
– SimpleDB
21
23. GRAPH DATABASE
• Social Graph with Relationship between Entities
• Great for Social Networks
– Facebook friends network
– LinkedIn connections network
• Some of the market leaders:
– Neo4j
– FlockDB
– Pregel
23
24. GRAPH DATABASE - EXAMPLE
24
• Nodes represent entities such
as
people, businesses, accounts,
or any other item you might
want to keep track of.
• Properties are pertinent
information that relate to
nodes such as
name, age, DOB, gender.
• Edges are the lines that
connect nodes to nodes or
nodes to properties and they
represent the relationship
between the two.
26. NEWSQL
• Argument is that Relational Model is not the problem for lack of
scalability but the physical implementation limitations
• Development of new relational database products and services
designed to bring the benefits of the relational model to distributed
architectures
• Three Approaches:
– Optimized MySQL storage engines (ScaleDB, MemSQL, Akiban)
– New SQL databases (Clusterix, VoltDB, NuoDB)
– Sharding Middleware to split RDBMS across nodes
(ScaleBase, Scalearc, dbShards)
26
28. SOURCE AND APPROACH
• Independent testing done by Altoros Systems Inc.
• More details at
http://www.networkworld.com/news/tech/2012/102212-nosql-
263595.html?page=1
• Using Amazon virtual machines to ensure verifiable results and
research transparency (which also helped minimize errors due to
hardware differences)
– Riak, a key-value store
– Cassandra, a column family store
– Hbase, a column family store
– MongoDB, a document-oriented database
– MySQL Cluster, a NewSQL
– Sharded MySQL, a NewSQL
28
32. 32
EXAMPLE: HEALTHCARE
A health care consultancy has made the data coming out of medical practices
the focus of its thriving business. The company collects billing and diagnostic
code data from 10,000 doctors on a daily, weekly and monthly basis to create
a virtual clinical integration model. The consulting company analyzes the data
to help the groups understand how well they are meeting the FTC guidelines
for negotiating with health plans and whether they qualify for enhanced
reimbursement based on offering a more cost-effective standard of care.
It also sends them automated information to better take care of patients, like
creating an automated outbound calling system for pediatric patients who
weren’t up to date on their vaccinations.
33. 33
EXAMPLE: RETAIL
Walmart handles more than 1 million customer transactions every
hour, which is imported into databases estimated to contain more than 2.5
petabytes * of data — the equivalent of 167 times the information
contained in all the books in the US Library of Congress.
34. 34
EXAMPLE: UTILITY
With a smart meter, a utility company goes from collecting one data point
a month per customer (using a meter reader in a truck or car) to receiving
3,000 data points for each customer each month, while smart meters
send usage information up to four times an hour.
One small Midwestern utility is using smart meter data to structure
conservation programs that analyze existing usage to forecast future
use, price usage based on demand and share that information with
customers who might decide to forestall doing that load of wash until
they can pay for it at the nonpeak price.