Bigdata.
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."[2] Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics simulations, biology and environmental research.[5]
Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks.[6][7] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[8] as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated.[9] One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.[10]
Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require "massively parallel software running on tens, hundreds, or even thousands of servers".[11] What counts as "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
I've shown you in this ppt, the difference between Data and Big Data. How Big Data is generated, Opportunities with Big Data, Problem occurred in Big Data, solution of that problem, Big Data tools, What is Data Science & how it's related with the Big Data, Data Scientist vs Data Analyst. At last, one Real-life scenario where Big data, data scientists, and data analysts work together.
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
Talk by Usama Fayyad at BigMine12 at KDD12.
Virtually all organizations are having to deal with Big Data in many contexts: marketing, operations, monitoring, performance, and even financial management. Big Data is characterized not just by its size, but by its Velocity and its Variety for which keeping up with the data flux, let alone its analysis, is challenging at best and impossible in many cases. In this talk I will cover some of the basics in terms of infrastructure and design considerations for effective an efficient BigData. In many organizations, the lack of consideration of effective infrastructure and data management leads to unnecessarily expensive systems for which the benefits are insufficient to justify the costs. We will refer to example frameworks and clarify the kinds of operations where Map-Reduce (Hadoop and and its derivatives) are appropriate and the situations where other infrastructure is needed to perform segmentation, prediction, analysis, and reporting appropriately – these being the fundamental operations in predictive analytics. We will thenpay specific attention to on-line data and the unique challenges and opportunities represented there. We cover examples of Predictive Analytics over Big Data with case studies in eCommerce Marketing, on-line publishing and recommendation systems, and advertising targeting: Special focus will be placed on the analysis of on-line data with applications in Search, Search Marketing, and targeting of advertising. We conclude with some technical challenges as well as the solutions that can be used to these challenges in social network data.
Bigdata.
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."[2] Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics simulations, biology and environmental research.[5]
Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks.[6][7] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[8] as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated.[9] One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.[10]
Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require "massively parallel software running on tens, hundreds, or even thousands of servers".[11] What counts as "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
I've shown you in this ppt, the difference between Data and Big Data. How Big Data is generated, Opportunities with Big Data, Problem occurred in Big Data, solution of that problem, Big Data tools, What is Data Science & how it's related with the Big Data, Data Scientist vs Data Analyst. At last, one Real-life scenario where Big data, data scientists, and data analysts work together.
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
Talk by Usama Fayyad at BigMine12 at KDD12.
Virtually all organizations are having to deal with Big Data in many contexts: marketing, operations, monitoring, performance, and even financial management. Big Data is characterized not just by its size, but by its Velocity and its Variety for which keeping up with the data flux, let alone its analysis, is challenging at best and impossible in many cases. In this talk I will cover some of the basics in terms of infrastructure and design considerations for effective an efficient BigData. In many organizations, the lack of consideration of effective infrastructure and data management leads to unnecessarily expensive systems for which the benefits are insufficient to justify the costs. We will refer to example frameworks and clarify the kinds of operations where Map-Reduce (Hadoop and and its derivatives) are appropriate and the situations where other infrastructure is needed to perform segmentation, prediction, analysis, and reporting appropriately – these being the fundamental operations in predictive analytics. We will thenpay specific attention to on-line data and the unique challenges and opportunities represented there. We cover examples of Predictive Analytics over Big Data with case studies in eCommerce Marketing, on-line publishing and recommendation systems, and advertising targeting: Special focus will be placed on the analysis of on-line data with applications in Search, Search Marketing, and targeting of advertising. We conclude with some technical challenges as well as the solutions that can be used to these challenges in social network data.
This presentation introduces concepts of Big Data in a layman's language. Author does not claim the originality of the content. The presentation is made by compiling from various sources. Author does not claim copyrights or privacy issues.
Big data is exponentially rising in today's age of information and digital shrinkage. This presentation potentially clears the concept and revolving hype around it.
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Edureka!
** Hadoop Training: https://www.edureka.co/hadoop **
This Edureka tutorial on "Data Science vs Big Data vs Data Analytics" will explain you the similarities and differences between them. Also, you will get a complete insight into the skills required to become a Data Scientist, Big Data Professional, and Data Analyst.
Below topics are covered in this tutorial:
1. What is Data Science, Big Data, Data Analytics?
2. Roles and Responsibilities of Data Scientist, Big Data Professional and Data Analyst
3. Required Skill set.
4. Understanding how data science, big data, and data analytics is used to drive the success of Netflix.
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
BDaas- BigData as a service by "Sherya Pal" from "Saama". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
This presentation introduces concepts of Big Data in a layman's language. Author does not claim the originality of the content. The presentation is made by compiling from various sources. Author does not claim copyrights or privacy issues.
Big data is exponentially rising in today's age of information and digital shrinkage. This presentation potentially clears the concept and revolving hype around it.
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Edureka!
** Hadoop Training: https://www.edureka.co/hadoop **
This Edureka tutorial on "Data Science vs Big Data vs Data Analytics" will explain you the similarities and differences between them. Also, you will get a complete insight into the skills required to become a Data Scientist, Big Data Professional, and Data Analyst.
Below topics are covered in this tutorial:
1. What is Data Science, Big Data, Data Analytics?
2. Roles and Responsibilities of Data Scientist, Big Data Professional and Data Analyst
3. Required Skill set.
4. Understanding how data science, big data, and data analytics is used to drive the success of Netflix.
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
BDaas- BigData as a service by "Sherya Pal" from "Saama". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
This workshop is for a "Big Data using Hadoop course" at IMC Institute in March 2015. The workshop is based on Apache Hadoop and using an EC2 server on AWS.
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SMEIMC Institute
เอกสารบรรยายงานสัมมนา Cloud Computing
New generation of SMEs Management for ASEAN Economic Community (AEC) by using Cloud Computing Technology วันเสาร์ที่ 28 กุมภาพันธ์ 2558 เวลา 12.30 – 15.45น.
ณ โรงแรม The Emerald Hotel-Bangkok
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
An Introduction to Big data and problems associated with storing and analyzing big data and How Hadoop solves the problem with its HDFS and MapReduce frameworks. A little intro to HDInsight, Hadoop on windows azure.
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
A brief intro on the idea of what is Big Data and it's potential. This is primarily a basic study & I have quoted the source of infographics, stats & text at the end. If I have missed any reference due to human error & you recognize another source, please mention.
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
1. Introduction to Big Data
Dr. Putchong Uthayopas
Department of Computer Engineering,
Faculty of Engineering, Kasetsart University
Email: putchong@ku.th
2. We
are
living
in
the
world
of
Data
Geophysical
Exploration
Medical Imaging
Video
Surveillance
Mobile Sensors
Gene Sequencing
Smart Grids
Social Media
3.
4.
5.
6.
7. Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.
“Gartner Inc.”
8.
9.
10. Why
BigData?
• Improve
product
and
service
• Increase
customer
sa<sfac<on/behavior
• Improve
opera<on
efficiency
• Understand
emerging
market
trends
The
real
value
of
big
data
is
in
the
insights
it
produces
when
analyzed—
discovered
paEerns,
derived
meaning,
indicators
for
decisions,
and
ul<mately
the
ability
to
respond
to
the
world
with
greater
intelligence.
Know thy self, know
thy enemy. A
thousand battles, a
thousand victories.
h#p://www.intel.com/content/dam/www/public/us/en/
documents/product-‐briefs/big-‐data-‐cloud-‐technologies-‐
brief.pdf
)
12. Big
Data
vs
Business
Intelligent
vs.
Analy<cs
• BI
soLware
and
technology
– Well
structure
data
from
warehouse
– Visual
Representa<on
of
data
to
gain
insight
into
data
–
Some
predic<ve
capability
such
as
sta<s<cal
analysis
,
Data
mining
• Big
Data
– Focus
on
analysis
of
huge
and
unstructured
data
set
to
gain
insight
informa<on
automa<cally
14. Volume
• Big
data
must
be
huge
– Beyond
the
capability
of
a
single
computer
server
to
process
it
– Possible
to
store
the
data
but
difficult
to
process
it
15. Velocity
• Big
data
accumulate
at
a
very
fast
speed
– Stock
market
data
– Internet
access
log
– Social
media
data
• TwiEer
,
facebook,
IG
• We
need
to
– Extract
meaning
as
fast
and
as
much
as
we
can
before
throwing
away
the
data
16. Variety
• Data
come
with
variety
– Tradi<onal
data
base
– Documents
– Web
page
– Social
media
data
– Image
– Video/Audio
– Loca<on
17. Diya
Soubra,
The
3Vs
that
define
Big
Data,
2012
hEp://www.datasciencecentral.com/forum/topics/the-‐3vs-‐that-‐define-‐big-‐data
18. Considera<on
for
Applying
Big
Data
hEp://fredericgonzalo.com/en/2013/07/07/big-‐data-‐in-‐tourism-‐hospitality-‐4-‐key-‐components/
20. Big
Data
Ecosystem
Reference:
hEp://dataconomy.com/understanding-‐big-‐data-‐ecosystem/
21. Big
Data
Eco
system-‐
Infrastructure
• Hadoop-‐
– technologies
designed
for
the
storing,
processing
and
analysing
of
data
by
breaking
up
and
distribu<ng
data
into
parts
and
analysing
those
parts
concurrently,
rather
than
tackling
one
monolithic
block
of
data
all
in
one
go.
• NoSQL
– Stands
for
Not
Only
SQL
– involved
in
processing
large
volumes
of
mul<-‐structured
data.
Most
NoSQL
databases
are
most
adept
at
handling
discrete
data
stored
among
mul<-‐structured
data.
• Massively
Parallel
Processing
(MPP)
Databases
– MPP
databases
work
by
segmen<ng
data
across
mul<ple
nodes,
and
processing
these
segments
of
data
in
parallel,
and
uses
SQL.
Reference:
hEp://dataconomy.com/understanding-‐big-‐data-‐ecosystem/
22. Big
Data
Eco
system-‐
Analy<cs
• AnalyHcs
PlaIorms
– Integrate
and
analyse
data
to
uncover
new
insights,
and
help
companies
make
beEer-‐
informed
decisions.
• VisualizaHon
PlaIorms
–
visualizing
data;
taking
the
raw
data
and
presen<ng
it
in
complex,
mul<-‐dimensional
visual
formats
to
illuminate
the
informa<on
• Business
Intelligence
(BI)
PlaIorms
– analyze
data
from
mul<ple
sources
to
deliver
services
such
as
business
intelligence
reports,
dashboards
and
visualiza<ons
• Machine
Learning
– machine
learning
is
data
the
algorithm
‘learns
from’,
and
the
output
depends
on
the
use
case.
One
of
the
most
famous
examples
is
IBM’s
super
computer
Watson,
which
has
‘learned’
to
scan
vast
amounts
of
informa<on
to
find
specific
answers,
and
can
comb
through
200
million
pages
of
structured
and
unstructured
data
in
minutes.
Reference:
hEp://dataconomy.com/understanding-‐big-‐data-‐ecosystem/
23. How
can
we
store
and
process
massive
data
• Beyond
capability
of
a
single
server
• Basic
Infrastructure
– Cluster
of
servers
– High
speed
interconnected
– High
speed
storage
cluster
• Incoming
data
will
be
spread
across
the
server
farm
• Processing
is
quickly
distributed
to
the
farm
• Result
is
collected
and
send
back
24. NoSQL
(Not
Only
SQL)
• A
NoSQL
(oLen
interpreted
as
Not
only
SQL)
database
provides
a
mechanism
for
storage
and
retrieval
of
data
that
is
modeled
in
means
other
than
the
tabular
rela<ons
used
in
rela<onal
databases.
– being
non-‐relaHonal,
distributed,
open-‐
source
and
horizontally
scalable.
– Used
to
handle
a
huge
amount
of
data
– The
original
inten<on
has
been
modern
web-‐scale
databases.
Reference:
hEp://nosql-‐database.org/
25. • MongoDB
is
a
general
purpose,
open-‐source
database.
• MongoDB
features:
– Document
data
model
with
dynamic
schemas
– Full,
flexible
index
support
and
rich
queries
– Auto-‐Sharding
for
horizontal
scalability
– Built-‐in
replica<on
for
high
availability
– Text
search
– Advanced
security
26. • Hadoop
is
an
open-‐source
soLware
framework
wriEen
in
Java
for
distributed
storage
and
distributed
processing
of
very
large
data
sets
on
computer
clusters
built
from
commodity
hardware.
• The
base
Apache
Hadoop
framework
is
composed
of
the
following
modules:
– Hadoop
Common
–
contains
libraries
and
u<li<es
needed
by
other
Hadoop
modules;
– Hadoop
Distributed
File
System
(HDFS)
–
a
distributed
file-‐system
that
stores
data
on
commodity
machines,
providing
very
high
aggregate
bandwidth
across
the
cluster;
– Hadoop
YARN
–
a
resource-‐management
plakorm
responsible
for
managing
compute
resources
in
clusters
and
using
them
for
scheduling
of
users'
applica<ons;and
– Hadoop
MapReduce
–
a
programming
model
for
large
scale
data
processing.
• Hadoop
was
created
by
Doug
Cumng
and
Mike
Cafarella
in
2005.
Cumng,
who
was
working
at
Yahoo!
at
the
<me,
named
it
aLer
his
son's
toy
elephant.
27. Magic
behind
Hadoop
and
HDFS
• Problem
is
divided
into
two
phases
– Map
applying
some
ac<on
to
data
in
<key,
Value>
Pair
and
get
some
intermediate
results
– Reduce
summarize
intermediate
result
<key,value>
and
return
back
to
main
program
Ricky
Ho,
How
Hadoop
Map/Reduce
works,
hEp://architects.dzone.com/ar<cles/how-‐hadoop-‐mapreduce-‐works
28. Example:
Word
count
• Coun<ng
word
in
an
input
text
file.
– How
many
word
“love”
in
a
novel?
^_^
• In
map
phase
the
sentence
would
be
split
as
words
and
form
the
ini<al
key
value
pair
<word,
1>
• “tring
tring
the
phone
rings”
becomes
<tring,1>
,<tring,1>,
<the,1>,
<phone,1>,
<rings,1>
– In
the
reduce
phase
the
keys
are
grouped
together
and
the
values
for
similar
keys
are
added.
• There
are
only
one
pair
of
similar
keys
‘tring’
the
values
for
these
keys
would
be
added
so
the
out
put
key
value
pairs
would
be
• <tring,2>,
<the,1>,
<phone,1>,
<rings,1>
• Reduce
forms
an
aggrega<on
phase
for
keys
– This
would
give
the
number
of
occurrence
of
each
word
in
the
input.
hEp://kickstarthadoop.blogspot.com/2011/04/word-‐count-‐hadoop-‐map-‐reduce-‐
example.html
29. In-‐memory
Database
• An
in-‐memory
database
is
– a
database
management
system
that
primarily
relies
on
main
memory
for
computer
data
storage.
– faster
than
disk-‐op<mized
databases
since
the
internal
op<miza<on
algorithms
are
simpler
and
execute
fewer
CPU
instruc<ons.
–
Accessing
data
in
memory
eliminates
seek
<me
when
querying
the
data,
which
provides
faster
and
more
predictable
performance
than
disk.
Source:
hEp://en.wikipedia.org/wiki/In-‐memory_database
30.
31. What
is
Spark?
Efficient
• General
execu<on
graphs
• In-‐memory
storage
Usable
• Rich
APIs
in
Java,
Scala,
Python
• Interac<ve
shell
Fast and Expressive Cluster Computing !
Engine Compatible with Apache Hadoop
2-‐5×
less
code
Up
to
10×
faster
on
disk,
100×
in
memory
33. Spark
at
Yahoo
• Personalizing
news
pages
for
Web
visitors
and
another
for
running
analy<cs
for
adver<sing.
For
news
personaliza<on,
the
company
uses
ML
algorithms
running
on
Spark
to
figure
out
what
individual
users
are
interested
in,
and
also
to
categorize
news
stories
as
they
arise
to
figure
out
what
types
of
users
would
be
interested
in
reading
them.
– wrote
a
Spark
ML
algorithm
120
lines
of
Scala.
(Previously,
its
ML
algorithm
for
news
personaliza<on
was
wriEen
in
15,000
lines
of
C++.)
– With
just
30
minutes
of
training
on
a
large,
hundred
million
record
data
set,
the
Scala
ML
algorithm
was
ready
for
business.
• Second
use
case
shows
off
Hive
on
Spark
(Shark’s)
interac<ve
capability.
– use
exis<ng
BI
tools
to
view
and
query
their
adver<sing
analy<c
data
collected
in
Hadoop.
hEp://www.datanami.com/2014/03/06/apache_spark_3_real-‐
world_use_cases/
34. BigData
Goes
to
Cloud
• Data
is
already
on
the
cloud
– Virtual
organiza<on
– Cloud
based
SaaS
Service
• Big
Data
As
a
Service
on
the
Cloud
– Private
Cloud
– Public
Cloud
35. Amazon
• Amazon
EC2
– Computa<on
Service
using
VM
• Amazon
DynamoDB
– Large
scalable
NoSQL
databased
– Fully
distributed
shared
nothing
architecture
• Amazon
Elas<c
MapReduce
(Amazon
EMR)
– Hadoop
based
analysis
engine
– Can
be
used
to
analyse
big
data
without
the
need
to
build
the
infrastucture
hEp://aws.amazon.com/big-‐data/
36. Google
Cloud
Plakorm
• App
engines
– mobile
and
web
app
• Cloud
SQL
– MySQL
on
the
cloud
• Cloud
Storage
– Data
storage
• Big
Query
– Data
analysis
• Google
Compute
Engine
– Processing
of
large
data
39. Current
Trends
• Big
data
toward
real
usage
– From
pilot
to
real
usage
• More
soLware
solu<on
– Infrastructure
– Analy<cs
• Sta<s<cal
Analysis
• Social
Graph
Analysis
• More
unstructured
data
– Facebook
,
twiEer,
text
,
video,
image
Analy<cs
Structured
Unstructured
Big
Data
40. Google
Flu
• paEern
emerges
when
all
the
flu-‐
related
search
queries
are
added
together.
• We
compared
our
query
counts
with
tradi<onal
flu
surveillance
systems
and
found
that
many
search
queries
tend
to
be
popular
exactly
when
flu
season
is
happening.
• By
coun<ng
how
oLen
we
see
these
search
queries,
we
can
es<mate
how
much
flu
is
circula<ng
in
different
countries
and
regions
around
the
world.
hEp://www.google.org/flutrends/
about/how.html
41. WHAT
FACEBOOK
KNOWS
hEp://www.facebook.com/data
Cameron
Marlow
calls
himself
Facebook's
"in-‐
house
sociologist."
He
and
his
team
can
analyze
essen<ally
all
the
informa<on
the
site
gathers.
42. Study
of
Human
Society
• Facebook,
in
collabora<on
with
the
University
of
Milan,
conducted
experiment
that
involved
– the
en<re
social
network
as
of
May
2011
– more
than
10
percent
of
the
world's
popula<on.
• Analyzing
the
69
billion
friend
connec<ons
among
those
721
million
people
showed
that
– four
intermediary
friends
are
usually
enough
to
introduce
anyone
to
a
random
stranger.
43. Why?
• Facebook
can
improve
users
experience
– make
useful
predic<ons
about
users'
behavior
– make
beEer
guesses
about
which
ads
you
might
be
more
or
less
open
to
at
any
given
<me
• Right
before
Valen<ne's
Day
this
year
a
blog
post
from
the
Data
Science
Team
listed
the
songs
most
popular
with
people
who
had
recently
signaled
on
Facebook
that
they
had
entered
or
leL
a
rela<onship
44. How
facebook
handle
Big
Data?
• Facebook
built
its
data
storage
system
using
open-‐
source
soLware
called
Hadoop.
– Hadoop
spreading
them
across
many
machines
inside
a
data
center.
– Use
Hive,
open-‐source
that
acts
as
a
transla<on
service,
making
it
possible
to
query
vast
Hadoop
data
stores
using
rela<vely
simple
code.
• Much
of
Facebook's
data
resides
in
one
Hadoop
store
more
than
100
petabytes
(a
million
gigabytes)
in
size,
says
Sameet
Agarwal,
a
director
of
engineering
at
Facebook
who
works
on
data
infrastructure,
and
the
quan<ty
is
growing
exponen<ally.
"Over
the
last
few
years
we
have
more
than
doubled
in
size
every
year,”
45. eBay
• eBay
is
using
Hadoop
technology
and
the
Hbase
database,
which
supports
real-‐
<me
analysis
of
Hadoop
data,
to
build
a
new
search
engine
for
its
auc<on
site.
– 97
million
ac<ve
buyers
and
sellers
– over
200
million
items
for
sale
in
50,000
categories.
– The
site
handles
close
to
2
billion
page
views.
–
250
million
search
queries
and
tens
of
billions
of
database
calls
daily.
• The
company
has
9
petabytes
of
data
stored
on
Hadoop
and
Teradata
clusters,
and
the
amount
is
growing
quickly,
he
said.
• 100
eBay
engineers
are
working
on
the
Cassini
project.
The
new
engine
is
expected
to
respond
to
user
queries
with
results
that
are
context-‐based
and
more
accurate
than
those
provided
by
the
current
system.
Source:
hEp://www.computerworld.com/ar<cle/2550078/data-‐center/hadoop-‐is-‐ready-‐for-‐the-‐enterprise-‐-‐it-‐execs-‐say.html
46. • JPMorgan
Chase
s<ll
relies
heavily
on
rela<onal
database
systems
for
transac<on
processing.
• Hadoop
technology
is
used
for
a
growing
number
of
purposes,
including
fraud
detecGon,
IT
risk
management
and
self
service.
– With
over
150
petabytes
of
data
stored
online,
30,000
databases
and
3.5
billion
log-‐ins
to
user
accounts.
• Hadoop's
ability
to
store
vast
volumes
of
unstructured
data
allows
the
company
to
collect
and
store
Web
logs,
transac<on
data
and
social
media
data.
• The
data
is
aggregated
into
a
common
plakorm
for
use
in
a
range
of
customer-‐focused
data
mining
and
data
analy<cs
tools.
Source:
hEp://www.computerworld.com/ar<cle/2550078/data-‐center/hadoop-‐is-‐ready-‐for-‐the-‐enterprise-‐-‐it-‐execs-‐say.html
47. Premier
• Premier,
the
U.S.
healthcare
alliance
network.
More
than
2,700
members,
hospitals
and
health
systems,
90,000
non-‐acute
facili<es
and
400,000
physicians
– a
large
database
of
clinical,
financial,
pa<ent,and
supply
chain
data
– generated
comprehensive
and
comparable
clinical
outcome
measures,
resource
u<liza<on
reports
and
transac<on
level
cost
data.
• Big
data
is
used
to
improve
the
healthcare
processes
at
approximately
330
hospitals,
saving
an
es<mated
29,000
lives
and
reducing
healthcare
spending
by
nearly
$7
billion
Reference:
IBM:
Data
Driven
Healthcare
Organiza<ons
Use
Big
Data
Analy<cs
for
Big
Gains;
2013.
hEp://www03.ibm.com/industries/ca/en/healthcare/
documents/Data_driven_healthcare_organiza<ons_use_big_data_analy<cs_for_big_gains.pdf.
48. Some
Sucesss
• The
Rizzoli
Orthopedic
Ins<tute
in
Bologna,
Italy
– using
advanced
analy<cs
to
gain
a
more
“granular
understanding”
of
the
clinical
varia<ons
within
families
whereby
individual
pa<ents
display
extreme
differences
in
the
severity
of
their
symptoms.
• The
insight
is
reported
to
have
reduced
annual
hospitaliza<ons
by
30%
and
the
number
of
imaging
tests
by
60%.
49. Social
Media
Analy<cs
• Social
media
analyHcs
is
the
prac<ce
of
gathering
data
from
blogs
and
social
media
websites
and
analyzing
that
data
to
make
business
decisions.
The
most
common
use
of
social
media
analyHcs
is
to
mine
customer
sen<ment
in
order
to
support
marke<ng
and
customer
service
ac<vi<es.
What
is
social
media
analy<cs?
-‐
Defini<on
from
WhatIs.com
50.
51.
52. Star<ng
a
Big
Data
Ini<a<ve
Data
Infrastructure
Big
Data
Tools
Analy<cs
SoLware
Visualiza<on
Top
Down
BoEom
Up
53. Data
Product
• Data
Product
provides
ac<onalble
informa<on
without
exposing
decision
maker
to
the
underlying
data
or
analy<cs
– Movie
Recommenda<ons
– Weather
Forecast
– Stock
Market
Predic<on
– Opera<on
improvement
– Health
Diagnosis
– Targeted
Adver<sing
55. BoEom
up
approach
• What
is
the
data
that
we
have?
• How
can
we
collect
and
store
it?
• What
is
the
infrastructure
and
tool
to
process
this
big
data?
• What
analy<cs
method
can
be
apply?
• What
is
the
insight
we
can
gain
from
this
data
and
analysis?
56. Top
down
• What
is
the
business
challenge
that
can
create
value
and
impact
to
the
organiza<on?
• What
is
the
data
that
we
need?
• What
is
the
tools
and
analy<cs
approach
that
should
be
used
?
• What
is
the
infrastructure
needed?
57. Some
thought
• BoEom
up
approach
may
be
good
when
you
do
not
know
how
to
start?
• Pick
some
easy
ques<on
and
start
a
pilot
– Learning
infrastructure
technology,
analy<c
technology
and
tools
– Using
data
you
already
have
• Top
down
that
focus
on
business
value
is
beEer
but
challenging
– Hard
to
ask
a
good
ques<on,
need
management
to
iden<fy
the
need
– May
have
to
ask
many
ques<ons
and
pick
the
right
one
based
on
• Impact
and
value
•
58. Example:
What
is/is
not
big
data
problem?
• I
want
to
classify
the
legal
documents
to
make
it
easy
to
process
these
documents
• I
want
to
learn
how
our
customer
react
to
our
new
Tee-‐shirt
• I
want
to
understand
how
our
students
use
facebook
63. Informa<on
Tsunami
• Rapid
expansion
of
Smartphone
Usage,
social
compu<ng,
mobile
applica<on,
gaming
• Rapid
increases
in
Network
Bandwidth
and
coverage
– Wifi,
4G
• Rapid
move
toward
Internet
of
Things
(IOT)
– Sensor
everywhere,
mul<media
informa<on
64. Trend:
Big
data
infrastructure
becomes
even
more
powerful
and
easy
to
use
65. In-‐memory
Database
• An
in-‐memory
database
is
– a
database
management
system
that
primarily
relies
on
main
memory
for
computer
data
storage.
– faster
than
disk-‐op<mized
databases
since
the
internal
op<miza<on
algorithms
are
simpler
and
execute
fewer
CPU
instruc<ons.
–
Accessing
data
in
memory
eliminates
seek
<me
when
querying
the
data,
which
provides
faster
and
more
predictable
performance
than
disk.
Source:
hEp://en.wikipedia.org/wiki/In-‐memory_database
66.
67. What
is
Spark?
Efficient
• General
execu<on
graphs
• In-‐memory
storage
Usable
• Rich
APIs
in
Java,
Scala,
Python
• Interac<ve
shell
Fast and Expressive Cluster Computing !
Engine Compatible with Apache Hadoop
2-‐5×
less
code
Up
to
10×
faster
on
disk,
100×
in
memory
68. Spark
at
Yahoo
• Personalizing
news
pages
for
Web
visitors
and
another
for
running
analy<cs
for
adver<sing.
For
news
personaliza<on,
the
company
uses
ML
algorithms
running
on
Spark
to
figure
out
what
individual
users
are
interested
in,
and
also
to
categorize
news
stories
as
they
arise
to
figure
out
what
types
of
users
would
be
interested
in
reading
them.
– wrote
a
Spark
ML
algorithm
120
lines
of
Scala.
(Previously,
its
ML
algorithm
for
news
personaliza<on
was
wriEen
in
15,000
lines
of
C++.)
– With
just
30
minutes
of
training
on
a
large,
hundred
million
record
data
set,
the
Scala
ML
algorithm
was
ready
for
business.
• Second
use
case
shows
off
Hive
on
Spark
(Shark’s)
interac<ve
capability.
– use
exis<ng
BI
tools
to
view
and
query
their
adver<sing
analy<c
data
collected
in
Hadoop.
hEp://www.datanami.com/2014/03/06/apache_spark_3_real-‐
world_use_cases/
69. BigData
Infrastructure
Goes
to
Cloud
• Data
is
already
on
the
cloud
– Virtual
organiza<on
– Cloud
based
SaaS
Service
• Big
Data
As
a
Service
on
the
Cloud
– Private
Cloud
– Public
Cloud
• IBM
Bluemix,
Amazon
AWS
(EMR)
and
many
Big
Data
Services
Services
App
App
70. Trend:
Big
data
is
moving
toward
the
real
usage
71. Trends
• Big
data
toward
real
usage
– From
pilot
to
real
usage
• More
soLware
solu<on
– Infrastructure
– Analy<cs
• Sta<s<cal
Analysis
• Social
Graph
Analysis
with
machine
learning
• More
unstructured
data
– Facebook
,
twiEer,
text
,
video,
image
Analy<cs
Structured
Unstructured
Big
Data
73. Big
Data
Analy<cs
• a
set
of
advanced
technologies
designed
to
work
with
large
volumes
of
heterogeneous
data.
• explore
the
data
and
to
discover
interrela<onships
and
paEerns
using
sophis<cated
quan<ta<ve
methods
such
as
• machine
learning
• neural
networks
• robo<cs
algorithm
• computa<onal
mathema<cs
• ar<ficial
intelligence
74. Deep
Learning
• Deep
learning
is
a
subcategory
of
machine
learning
with
the
use
of
neural
networks
to
improve
things
like
speech
recogni<on,
computer
vision,
and
natural
language
processing.
– Unsupervised
learning
for
abstract
concept
75. Applying
Deep
Learning
• In
2011,
Stanford
computer
science
professor
Andrew
Ng
founded
Google’s
Google
Brain
project,
which
created
a
neural
network
trained
with
deep
learning
algorithms,
which
famously
proved
capable
ofrecognizing
high
level
concepts,
such
as
cats,
aLer
watching
just
YouTube
videos-‐-‐and
without
ever
having
been
told
what
a
“cat”
is.
• Facebook
using
deep
learning
exper<se
to
help
create
solu<ons
that
will
beEer
iden<fy
faces
and
objects
in
the
350
million
photos
and
videos
uploaded
to
Facebook
each
day.
• Voice
recogni<on
like
Google
Now
and
Apple’s
Siri
is
now
using
deep
learning.
– According
to
Google
researchers,
the
voice
error
rate
in
the
new
version
of
Android-‐-‐aLer
adding
insights
from
deep
learning-‐-‐stands
at
25%
lower
than
previous
versions
of
the
soLware.
Source:
h#p://www.fastcolabs.com/3026423/why-‐google-‐is-‐invesGng-‐in-‐deep-‐learning
h#p://www.wired.com/2014/08/deep-‐learning-‐yann-‐lecun/
76. IBM
Watson
and
Cogni<ve
Technology
• Watson
is
a
cogni<ve
technology
that
processes
informa<on
more
like
a
human
than
a
computer—by
understanding
natural
language,
genera<ng
hypotheses
based
on
evidence,
and
learning
as
it
goes.
And
learn
it
does.
• Watson
“gets
smarter”
in
three
ways:
– being
taught
by
its
users
–
learning
from
prior
interac<ons
– being
presented
with
new
informa<on.
• This
means
organiza<ons
can
more
fully
understand
and
use
the
data
that
surrounds
them,
and
use
that
data
to
make
beEer
decisions.
77. Applying
Watson
in
Healthcare
• WellPoint,
Inc.
is
an
Indianapolis-‐based
health
benefits
company.
– approximately
37
million
health
plan
members
– processes
more
than
550
million
claims
per
year.
• Using
IBM
Watson™
to
improve
the
quality
and
efficiency
of
healthcare
decisions.
– WellPoint
trained
Watson
with
25,000
historical
cases.
Now
Watson
uses
hypothesis
genera<on
and
evidence-‐based
learning
to
generate
confidence-‐
scored
recommenda<ons
that
help
nurses
make
decisions
about
u<liza<on
management.
Natural
language
processing
leverages
unstructured
data,
such
as
text-‐based
Treatment
requests.
• Benefit
– Helps
UM
nurses
make
faster
UM
decisions
about
treatment
requests
– Could
accelerate
healthcare
preapprovals,
which
can
be
cri<cal
when
treatments
are
<me-‐sensi<ve
– Includes
unstructured
data
in
the
streamlined
decision
process
78. Challenges
• Developing
Big
Data
Applica<on
is
not
simple
– New
algorithm,
new
soLware
development
tools
• Proper
policy
about
data
security
and
ownership
• Lack
of
Data
Scien<sts
– Different
from
SoLware
Developer