Tech4Africa - Opportunities around Big DataSteve Watt
The document discusses big data and techniques for gathering, storing, processing, and delivering large amounts of data at scale. It covers using Apache Nutch to crawl web data, storing data in Apache Hadoop's distributed file system and processing it using MapReduce. For low-latency queries, it recommends column stores like Apache HBase or Apache Cassandra. The document also discusses using machine learning on historical data to build models for real-time decision making, and challenges of processing unstructured data like prose.
No, combiner and reducer logic cannot be same.
Combiner is an optional step that performs local aggregation of the intermediate key-value pairs generated by the mappers. Its goal is to reduce the amount of data transferred from the mapper to the reducer.
Reducer performs the final aggregation of the values associated with a particular key. It receives the intermediate outputs from all the mappers, groups them by key, and produces the final output.
So while combiner and reducer both perform aggregation, their scopes of operation are different - combiner works locally on mapper output to minimize data transfer, whereas reducer operates globally on all mapper outputs to produce the final output. The logic needs to be optimized for their respective purposes.
Platforms for data science in 3 sentences:
Data science now deals with vast amounts of data from many sources, and cloud platforms provide scalable and programmable infrastructure that is well-suited to handle large-scale data and computation. The cloud allows data scientists to move analysis to where the data is stored and take advantage of utilities like Amazon Web Services to optimize costly resources. AWS and cloud platforms can partner with data scientists to build customized solutions for their specific computational and data handling needs.
Hadoop is an open source framework that allows for the distributed processing of large data sets across clusters of computers. It uses a MapReduce programming model where the input data is distributed, mapped and transformed in parallel, and the results are reduced together. This process allows for massive amounts of data to be processed efficiently. Hadoop can handle both structured and unstructured data, uses commodity hardware, and provides reliability through data replication across nodes. It is well suited for large scale data analysis and mining.
Slides for talk presented at Boulder Java User's Group on 9/10/2013, updated and improved for presentation at DOSUG, 3/4/2014
Code is available at https://github.com/jmctee/hadoopTools
Functional programming for optimization problems in Big DataPaco Nathan
Enterprise Data Workflows with Cascading.
Silicon Valley Cloud Computing Meetup talk at Cloud Tech IV, 4/20 2013
http://www.meetup.com/cloudcomputing/events/111082032/
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
The document discusses the growing importance of Hadoop and big data processing. It notes that by 2015, organizations that build modern information management systems using technologies like Hadoop will outperform peers financially by 20%. It then outlines Hortonworks' vision, including developing Hadoop into an enterprise-ready platform that can support a wide range of workloads and use cases beyond just batch processing. Finally, it discusses Hortonworks' role in driving adoption of Hadoop through open source community contributions as well as commercial support.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Tech4Africa - Opportunities around Big DataSteve Watt
The document discusses big data and techniques for gathering, storing, processing, and delivering large amounts of data at scale. It covers using Apache Nutch to crawl web data, storing data in Apache Hadoop's distributed file system and processing it using MapReduce. For low-latency queries, it recommends column stores like Apache HBase or Apache Cassandra. The document also discusses using machine learning on historical data to build models for real-time decision making, and challenges of processing unstructured data like prose.
No, combiner and reducer logic cannot be same.
Combiner is an optional step that performs local aggregation of the intermediate key-value pairs generated by the mappers. Its goal is to reduce the amount of data transferred from the mapper to the reducer.
Reducer performs the final aggregation of the values associated with a particular key. It receives the intermediate outputs from all the mappers, groups them by key, and produces the final output.
So while combiner and reducer both perform aggregation, their scopes of operation are different - combiner works locally on mapper output to minimize data transfer, whereas reducer operates globally on all mapper outputs to produce the final output. The logic needs to be optimized for their respective purposes.
Platforms for data science in 3 sentences:
Data science now deals with vast amounts of data from many sources, and cloud platforms provide scalable and programmable infrastructure that is well-suited to handle large-scale data and computation. The cloud allows data scientists to move analysis to where the data is stored and take advantage of utilities like Amazon Web Services to optimize costly resources. AWS and cloud platforms can partner with data scientists to build customized solutions for their specific computational and data handling needs.
Hadoop is an open source framework that allows for the distributed processing of large data sets across clusters of computers. It uses a MapReduce programming model where the input data is distributed, mapped and transformed in parallel, and the results are reduced together. This process allows for massive amounts of data to be processed efficiently. Hadoop can handle both structured and unstructured data, uses commodity hardware, and provides reliability through data replication across nodes. It is well suited for large scale data analysis and mining.
Slides for talk presented at Boulder Java User's Group on 9/10/2013, updated and improved for presentation at DOSUG, 3/4/2014
Code is available at https://github.com/jmctee/hadoopTools
Functional programming for optimization problems in Big DataPaco Nathan
Enterprise Data Workflows with Cascading.
Silicon Valley Cloud Computing Meetup talk at Cloud Tech IV, 4/20 2013
http://www.meetup.com/cloudcomputing/events/111082032/
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
The document discusses the growing importance of Hadoop and big data processing. It notes that by 2015, organizations that build modern information management systems using technologies like Hadoop will outperform peers financially by 20%. It then outlines Hortonworks' vision, including developing Hadoop into an enterprise-ready platform that can support a wide range of workloads and use cases beyond just batch processing. Finally, it discusses Hortonworks' role in driving adoption of Hadoop through open source community contributions as well as commercial support.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
8 douetteau - dataiku - data tuesday open source 26 fev 2013 Data Tuesday
Hal's company wants to build a big data platform but only has limited resources. He considers copying the approaches of larger competitors but thinks Dataiku may help him build a lab in six months. Dataiku claims to provide an open source core, the ability to connect different technologies, and deliver apps that provide ROI within a year through targeted newsletters and recommendations. Hal is interested in whether Dataiku can help his small team efficiently build the big data capabilities they need.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
Yahoo uses Apache Hadoop extensively to power many of its products and services. Hadoop allows Yahoo to gain insights from massive amounts of data, including user data from services like Flickr and Yahoo Mail. Yahoo has contributed over 70% of the code to the Apache Hadoop project to date. Hadoop is critical to Yahoo's business by enabling personalization, spam filtering, content optimization, and other data-driven features. Yahoo runs Hadoop on tens of thousands of servers storing over 100 petabytes of data. The company continues working to enhance Hadoop's scalability, flexibility, and performance to make it more suitable for enterprise use.
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
The document discusses Twitter's data analytics platform, including Hadoop and Vertica. It outlines Twitter's data flow, which ingests 400 million tweets daily into HDFS, then uses various tools like Crane, Oink, and Rasvelg to run jobs on the main Hadoop cluster before loading analytics into Vertica and MySQL for web tools and analysts. It also describes Twitter's heterogeneous technology stack and the various teams that use the analytics platform.
- Hadoop is a framework for managing and processing big data distributed across clusters of computers. It allows for parallel processing of large datasets.
- Big data comes from various sources like customer behavior, machine data from sensors, etc. It is used by companies to better understand customers and target ads.
- Hadoop uses a master-slave architecture with a NameNode master and DataNode slaves. Files are divided into blocks and replicated across DataNodes for reliability. The NameNode tracks where data blocks are stored.
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
This document outlines the modules and topics covered in an Edureka course on Hadoop. The 10 modules cover understanding Big Data and Hadoop architecture, Hadoop cluster configuration, MapReduce framework, Pig, Hive, HBase, Hadoop 2.0 features, and Apache Oozie. Interactive questions are also included to test understanding of concepts like Hadoop core components, HDFS architecture, and MapReduce job execution.
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
This document summarizes a presentation about using Hadoop for large scale data analysis. It introduces Hadoop's architecture which uses a distributed file system and MapReduce programming model. It discusses how Hadoop can handle large amounts of data reliably across commodity hardware. Examples shown include word count and stock analysis algorithms in MapReduce. The document concludes by mentioning other Hadoop projects like HBase, Pig and Hive that extend its capabilities.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
This presentation provides an overview of Hadoop, including:
- A brief history of data and the rise of big data from various sources.
- An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers.
- Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture.
- An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes.
- Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.
This document outlines an agenda for a Hadoop workshop covering Cisco's use of Hadoop. The agenda includes introductions, presentations on Hadoop concepts and Cisco's Hadoop architecture, and two hands-on exercises configuring Hadoop and using Hive and Impala for analytics. Key topics to be covered are Hadoop and big data concepts, Cisco's Webex Hadoop architecture using Cisco UCS, and how Hadoop addresses the challenges of large volumes of structured and unstructured data across global data centers.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
The UK space industry is thriving, currently contributing over £9 billion to the economy. The industry is expected to grow substantially to £40 billion by 2030. A new report outlines recommendations to increase the UK's share of the global space market from the current 7.3% to 10% by developing priority markets, making the UK attractive for space businesses, increasing returns from European Space Agency programs, supporting UK space exports, and stimulating small-to-medium enterprises in the sector. The interim target is to achieve 8% of the world's space economy by 2020. The report aims to create a space-enabled economy across non-space sectors to drive growth, jobs, and market share.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
8 douetteau - dataiku - data tuesday open source 26 fev 2013 Data Tuesday
Hal's company wants to build a big data platform but only has limited resources. He considers copying the approaches of larger competitors but thinks Dataiku may help him build a lab in six months. Dataiku claims to provide an open source core, the ability to connect different technologies, and deliver apps that provide ROI within a year through targeted newsletters and recommendations. Hal is interested in whether Dataiku can help his small team efficiently build the big data capabilities they need.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
Yahoo uses Apache Hadoop extensively to power many of its products and services. Hadoop allows Yahoo to gain insights from massive amounts of data, including user data from services like Flickr and Yahoo Mail. Yahoo has contributed over 70% of the code to the Apache Hadoop project to date. Hadoop is critical to Yahoo's business by enabling personalization, spam filtering, content optimization, and other data-driven features. Yahoo runs Hadoop on tens of thousands of servers storing over 100 petabytes of data. The company continues working to enhance Hadoop's scalability, flexibility, and performance to make it more suitable for enterprise use.
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
The document discusses Twitter's data analytics platform, including Hadoop and Vertica. It outlines Twitter's data flow, which ingests 400 million tweets daily into HDFS, then uses various tools like Crane, Oink, and Rasvelg to run jobs on the main Hadoop cluster before loading analytics into Vertica and MySQL for web tools and analysts. It also describes Twitter's heterogeneous technology stack and the various teams that use the analytics platform.
- Hadoop is a framework for managing and processing big data distributed across clusters of computers. It allows for parallel processing of large datasets.
- Big data comes from various sources like customer behavior, machine data from sensors, etc. It is used by companies to better understand customers and target ads.
- Hadoop uses a master-slave architecture with a NameNode master and DataNode slaves. Files are divided into blocks and replicated across DataNodes for reliability. The NameNode tracks where data blocks are stored.
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
This document outlines the modules and topics covered in an Edureka course on Hadoop. The 10 modules cover understanding Big Data and Hadoop architecture, Hadoop cluster configuration, MapReduce framework, Pig, Hive, HBase, Hadoop 2.0 features, and Apache Oozie. Interactive questions are also included to test understanding of concepts like Hadoop core components, HDFS architecture, and MapReduce job execution.
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
This document summarizes a presentation about using Hadoop for large scale data analysis. It introduces Hadoop's architecture which uses a distributed file system and MapReduce programming model. It discusses how Hadoop can handle large amounts of data reliably across commodity hardware. Examples shown include word count and stock analysis algorithms in MapReduce. The document concludes by mentioning other Hadoop projects like HBase, Pig and Hive that extend its capabilities.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
This presentation provides an overview of Hadoop, including:
- A brief history of data and the rise of big data from various sources.
- An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers.
- Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture.
- An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes.
- Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.
This document outlines an agenda for a Hadoop workshop covering Cisco's use of Hadoop. The agenda includes introductions, presentations on Hadoop concepts and Cisco's Hadoop architecture, and two hands-on exercises configuring Hadoop and using Hive and Impala for analytics. Key topics to be covered are Hadoop and big data concepts, Cisco's Webex Hadoop architecture using Cisco UCS, and how Hadoop addresses the challenges of large volumes of structured and unstructured data across global data centers.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
The UK space industry is thriving, currently contributing over £9 billion to the economy. The industry is expected to grow substantially to £40 billion by 2030. A new report outlines recommendations to increase the UK's share of the global space market from the current 7.3% to 10% by developing priority markets, making the UK attractive for space businesses, increasing returns from European Space Agency programs, supporting UK space exports, and stimulating small-to-medium enterprises in the sector. The interim target is to achieve 8% of the world's space economy by 2020. The report aims to create a space-enabled economy across non-space sectors to drive growth, jobs, and market share.
Ryan Paul Slauson is a senior at the University of Oregon studying political science with a minor in business administration. He has extensive leadership experience through various roles including an internship in a state assembly office, working as a custodian and camp counselor, and is on track to graduate in June 2016. Slauson demonstrates a strong work ethic, reliability, and excellent communication skills both independently and as part of a team.
The document provides information about the Navajo people across multiple categories. It states that the Navajo were farmers, hunters, and gatherers who principally ate mutton and corn prepared in various ways. Their shelters originally included teepees made of buffalo skin. For clothing, Navajo wore deerskin shirts and skirts while men later wore cotton shirts and trousers and women wore plain dark blankets. The Navajo lived in small family groups and each family lived near their corn fields, with men hunting and women taking care of sheep and crops. Weaving wool from sheep is also discussed as a special Navajo tradition.
El documento presenta información sobre varios destinos turísticos populares en Ecuador, incluyendo Montañita, el Parque Nacional Cajas, las Islas Galápagos, Guayaquil, Quito, Cuenca, Esmeraldas, Salinas, Manta y Otavalo. Para cada lugar, se proporcionan detalles sobre atracciones turísticas comunes y cómo llegar.
Test Estimation Hacks: Tips, Tricks and Tools WebinarQASymphony
In this webinar, Matt Heusser explains how not only how to deal with tough questions, but how to prepare and defend estimates that stand up to scrutiny. The conversation includes six estimating models - comparison, functional decomposition, timeboxed, and prediction, along the Guru Method and, perhaps, a little on #NoEstimates.
Don’t miss this opportunity to learn:
Learn the common mistakes in software test estimation
How Testing is different than linear tasks like development (and how to talk about it)
Learn what goes wrong in discussions about schedule
An explanation of ways to estimate for test - by comparison, functional decomposition, timeboxing, prediction and the guru method
How to recognize when you are actually in test negotiation, not test estimation...and what to do about it
Matt Heusser will discusses these topics and much, much more! Watch now: http://pi.qasymphony.com/test-estimation-hacks-webinar-lp057
Surgical audit is a process that systematically analyzes surgical care quality against standards to improve patient outcomes. It involves collecting data on parameters like mortality, complications and outcomes and comparing results to peers to identify areas for improvement. The goal is continuous quality improvement through a non-punitive, educational process. Surgical audit has existed for centuries but modern methods began in the early 1900s and involve retrospective review of existing data to guide practice changes.
O documento descreve os tipos de transporte transmembrana, incluindo transporte passivo por difusão e osmose, e transporte ativo que requer energia. Também discute os processos de endocitose e exocitose que envolvem a ingestão e liberação de substâncias através da membrana celular.
This document discusses how Apache Kafka and event streaming fit within a data mesh architecture. It provides an overview of the key principles of a data mesh, including domain-driven decentralization, treating data as a first-class product, a self-serve data platform, and federated governance. It then explains how Kafka's publish-subscribe event streaming model aligns well with these principles by allowing different domains to independently publish and consume streams of data. The document also describes how Kafka can be used to ingest existing data sources, process data in real-time, and replicate data across the mesh in a scalable and interoperable way.
1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets.
2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance.
3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.
This document discusses big data and how new data models are disrupting traditional approaches. It notes that while the new models are initially difficult to understand and threaten existing investments, they are capable of processing large volumes of data quickly. The document examines concepts like Hadoop, NoSQL, and how relational and non-relational approaches can work together in a hybrid environment. It concludes that trends point to more unified support of different data types and expanded capabilities in systems like real-time analytics and embedded search.
Cloud computing, big data, and mobile technologies are driving major changes in the IT world. Cloud computing provides scalable computing resources over the internet. Big data involves extremely large data sets that are analyzed to reveal business insights. Hadoop is an open-source software framework that allows distributed processing of big data across commodity hardware. It includes tools like HDFS for storage and MapReduce for distributed computing. The Hadoop ecosystem also includes additional tools for tasks like data integration, analytics, workflow management, and more. These emerging technologies are changing how businesses use and analyze data.
This document discusses the rapid growth of digital data and the challenges of analyzing large, unstructured datasets. It notes that in just one week in 2000, the Sloan Digital Sky Survey collected more data than had been collected in all of astronomy previously. Today, the Large Hadron Collider generates 40 terabytes per second and Twitter generates over 1 terabyte of tweets daily. By 2013, annual internet traffic was predicted to reach 667 exabytes. Hadoop provides a framework to analyze these vast and diverse datasets by distributing processing across commodity clusters close to where the data is stored.
This document discusses Hadoop and its relationship to Microsoft technologies. It provides an overview of what Big Data is, how Hadoop fits into the Windows and Azure environments, and how to program against Hadoop in Microsoft environments. It describes Hadoop capabilities like Extract-Load-Transform and distributed computing. It also discusses how HDFS works on Azure storage and support for Hadoop in .NET, JavaScript, HiveQL, and Polybase. The document aims to show Microsoft's vision of making Hadoop better on Windows and Azure by integrating with technologies like Active Directory, System Center, and SQL Server. It provides links to get started with Hadoop on-premises and on Windows Azure.
This document provides an overview of Spark, including:
- Spark was developed in 2009 at UC Berkeley and open sourced in 2010, with over 200 contributors.
- Spark Core is the general execution engine that other Spark functionality is built on, providing in-memory computing and supporting various programming languages.
- Spark Streaming allows data to be ingested from sources like Kafka and Flume and integrated with Spark for advanced analytics on streaming data.
Hadoop and Internet of Things presentation from Sinergija 2014 conference, held in Belgrade in October 2014. How the rising data resources change the business, and how the Big Data technologies combined with Internet of Things devices can help to improve the business and the everyday life. Hadoop is already the most significant technology for working with Big Data. Microsoft is playing a very important role in this field, with the Stinger initiative. The main goal is to bring the enterprise SQL at Hadoop scale.
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
This document provides an overview of big data concepts and technologies. It discusses the 3 Vs, 4 Vs and 6 Vs frameworks used to describe big data. Key big data technologies mentioned include MapReduce, Hadoop, HDFS, YARN, and NoSQL databases like MongoDB, Cassandra, HBase and Dynamo. The Lambda architecture and CAP theorem concepts are also covered. Large internet companies like Google, Amazon, eBay are discussed as examples of organizations that have pioneered big data solutions to handle massive volumes of dynamic data at high velocity.
This document discusses tools for large scale data analysis. It begins by defining business value as anything that makes people more likely to give money or saves costs. It then discusses how data has outgrown local storage and requires scaling out to clusters and distributed systems. The document lists various systems that can be used for data ingestion, storage, querying, processing and output. It covers batch systems like Hadoop and real-time systems like Storm. It emphasizes that to generate business value, one needs to start analyzing big data from various sources like web logs, sensors and parse noise to find signals.
The document discusses The Apache Way Done Right and the success of Hadoop. It provides an overview of Apache Hadoop, including that it is a set of open source projects that transforms commodity hardware into a reliable system for storing and analyzing large amounts of data. It also discusses how Hadoop originated from the Nutch project and was adopted by early users like Yahoo, Facebook, and Twitter to handle big data challenges. Examples are given of how Yahoo used Hadoop for applications like the Webmap and personalized homepages.
The document discusses Hadoop and IoT. It provides an overview of big data and Hadoop, describing its core components like HDFS, MapReduce, and YARN. It also discusses how IoT generates large amounts of structured and unstructured data from devices. Hadoop is well suited to process and analyze the volume of data generated by IoT. The document also summarizes Hive, a data warehousing component of Hadoop that provides SQL-like queries to analyze IoT and other large datasets.
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...Keiichiro Ono
Cytoscape is a standard desktop application for biological network analysis and visualization, but emerging problems include large network datasets that exceed desktop capabilities, demand for collaborative data sharing, and the need for self-publishing networks without web programming skills. CyNetShare is a first step towards a Cytoscape cyberinfrastructure that allows visualization of public network data files through an interactive web application using Cytoscape.js, sharing of visualizations via URL, and runs on both desktops and tablets.
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
The document discusses the principles of a data mesh architecture using Apache Kafka for event streaming. It describes a data mesh as having four key principles: 1) domain-driven decentralization where each domain owns the data it creates, 2) treating data as a first-class product, 3) providing a self-serve data platform for easy access to real-time and historical data, and 4) establishing federated governance with global standards. Event streaming is presented as a good fit for data meshing due to its scalability, ability to handle real-time and historical data, and immutability. The document provides examples and recommendations for implementing each principle in a data mesh.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Josh Patterson gave a presentation on Hadoop and how it has been used. He discussed his background working on Hadoop projects including for the Tennessee Valley Authority. He outlined what Hadoop is, how it works, and examples of use cases. This includes how Hadoop was used to store and analyze large amounts of smart grid sensor data for the openPDC project. He discussed integrating Hadoop with existing enterprise systems and tools for working with Hadoop like Pig and Hive.
Big Data Applications with Java discusses various big data technologies including Apache Hadoop, Apache Spark, Apache Kafka, and Apache Cassandra. It defines big data as huge volumes of data that cannot be processed using traditional approaches due to constraints on storage and processing time. The document then covers characteristics of big data like volume, velocity, variety, veracity, variability, and value. It provides overviews of Apache Hadoop and its ecosystem including HDFS and MapReduce. Apache Spark is introduced as an enhancement to MapReduce that processes data faster in memory. Apache Kafka and Cassandra are also summarized as distributed streaming and database platforms respectively. The document concludes by comparing Hadoop and Spark, outlining their relative performance, costs, processing capabilities,
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel par Cédric Carbone
-Spark vs Hadoop MapReduce (& Hadoop v2 vs Hadoop v1)
-Spark Streaming vs Storm
-Le Machine Learning avec Spark
-Use case métier : NextProductToBuy
Cloud computing, big data, and mobile are three major trends that will change the world. Cloud computing provides scalable and elastic IT resources as services over the internet. Big data involves large amounts of both structured and unstructured data that can generate business insights when analyzed. The hadoop ecosystem, including components like HDFS, mapreduce, pig, and hive, provides an architecture for distributed storage and processing of big data across commodity hardware.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
5. Web 2.0 Era Topic Map
Produce Process
Inexpensiv
Data e Storage
Explosion
LAM
Social P
Platform Publishin
s g
Platforms
Situational
Applications
Web 2.0 Mashups
Enterpris SOA
e
5
8. The data just keeps growing…
1024 GIGABYTE= 1 TERABYTE
1024 TERABYTES = 1 PETABYTE
1024 PETABYTES = 1 EXABYTE
1 PETABYTE 13.3 Years of HD Video
20 PETABYTES Amount of Data processed by Google daily
5 EXABYTES All words ever spoken by humanity
9. Mobile
App Economy for Devices Sensor Web
App for this App for that An instrumented and monitored world
Set Top Tablets, etc. Multiple Sensors in your pocket
Boxes
Real-time
Data
The Fractured Web
Opportunity
Facebook Twitter LinkedIn
Service Economy
Service for this Google NetFlix New York Times
Service for that eBay Pandora PayPal Web 2.0 Data Exhaust of
Historical and Real-time Data
Web 2.0 - Connecting People API Foundation
Web as a Platform
9 Web 1.0 - Connecting Machines Infrastructure
23. Storing, Reading and Processing - Apache Hadoop
Cluster technology with a single master and scale out with multiple slaves
It consists of two runtimes:
The Hadoop Distributed File System (HDFS)
Map/Reduce
As data is copied onto the HDFS it ensures the data is blocked and replicated to other
machines to provide redundancy
A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop
Master which in-turn distributes the job to each slave in the cluster.
Jobs run on data that is on the local disks of the machine they are sent to ensuring data
locality
Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-
execute a job on any node in the cluster.
Want to know more?
23
“Hadoop – The Definitive Guide (2nd Edition)”
24. Delivering Data @ Scale
• Structured Data
• Low Latency & Random Access
• Column Stores (Apache HBase or Apache Cassandra)
• faster seeks
• better compression
• simpler scale out
• De-normalized – Data is written as it is intended to be queried
Want to know more?
24
“HBase – The Definitive Guide” & “Cassandra High Performance
25. Storing, Processing & Delivering : Hadoop + NoSQL
Gather Read/Transfor Low-
m latency Application
Web Data
Nutch Query
Crawl
Serve
Copy
Apache
Hadoop
Log Files
Flume
Connector HDFS NoSQL
Repository
NoSQL
SQOOP Connector/A
Connector PI
Relational
Data
-Clean and Filter Data
(JDBC)
- Transform and Enrich Data
MySQL
- Often multiple Hadoop jobs
25
26. Some things to keep
in mind…
26
– Kanaka Menehune (Flickr)
27. Some things to keep in mind…
• Processing arbitrary types of data (unstructured, semi-
structured, structured) requires normalizing data with many different
kinds of readers
Hadoop is really great at this !
• However, readers won’t really help you process truly unstructured data
such as prose. For that you’re going to have to get handy with Natural
Language Processing. But this is really hard.
Consider using parsing services & APIs like Open Calais
Want to know more?
27
“Programming Pig” (O’REILLY)
29. Statistical real-time decision making
Capture Historical information
Use Machine Learning to build decision making models (such as
Classification, Clustering & Recommendation)
Mesh real-time events (such as sensor data) against Models to make
automated decisions
Want to know more?
29
“Mahout in Action”
33. Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2
For example:
http://www.crunchbase.com/companies?c=a&q=private_held
http://www.crunchbase.com/companies?c=b&q=private_held
http://www.crunchbase.com/companies?c=c&q=private_held
http://www.crunchbase.com/companies?c=d&q=private_held
...
Crawl data is stored in sequence files in the segments dir on the HDFS
33
37. Apache Pig Script to Join on City to get Zip
Code and Write the results to Vertica
ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int);
CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS
(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);
CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);
STORE CrunchBaseZip INTO
'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year
int, Investor int, Amount varchar(40))}’
USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
40. Total Investments By Zip Code for all Sectors
$1.2 Billion in Boston
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.7 Billion in Austin
40
41. Total Investments By Zip Code for Consumer Web
$600 Million in Seattle
$1.2 Billion in Chicago
$1.7 Billion in San Francisco
41
42. Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
42
43. Questions?
Steve Watt swatt@hp.com
@wattsteve
stevewatt.blogspot.com
43
Editor's Notes
As Hardware becomes increasing commoditized, the margin & differentiation moved to software, as software is becoming increasingly commoditized the margin & differentiation is moving to data2000 - Cloud is an IT Sourcing Alternative (Virtualization extends into Cloud)Explosion of Unstructured DataMobile“Let’s create a context in which to think….”Focused on 3 major tipping points in the evolution of the technology. Mention that this is a very web centric view contrasted to Barry Devlin’s Enterprise viewAssumes Networking falls under Hardware & Cloud is at the Intersection of Software and DataWhy should you care?Tipping Point 1: Situational ApplicationsTipping Point 2: Big DataTipping Point 3: Reasoning
Web 2.0(Information Explosion, Now Many Channels - Turning consumers into Producers (Shirky),Tipping point Web Standards allow Rapid Application Development, Advent of Situational Applications, Folksonomies,Social)SOA (Functionality exposed through open interfaces and open standards, Great strides in modularity and re-use whilst reducing complexities around system integration, Still need to be a developer to create applications using theseservice interfaces (WSDL, SOAP, way too complex !) Enter mashups…)Mashups (Place a façade on the service and you have the final step in the evolution of services and service based applications,Now anyone can build applications (i.e. non-programmers). We’ve taken the entire SOA Library and exposed it to non-programmers, What do I mean? Check out this YouTunes app…) 1st example where we saw arbitrary data/content re-purposed in ways the original authors never intended –eg. Craigslist gumtree/ homes for sales scraped and placed on google map mashed up w/ crime statistics. Whole greater than the sum of its parts -> New kinds of Information !!BUT Limitations around how much arbitrary data being scraped and turned into info. Usually no pre-processing and just what can be rendered on a single page.Demo
http://www.housingmaps.com/
“Every 2 days we create as much data as we did from the dawn of humanity until 2003” – We’ve hit the Petabyte & Exabyte age. What does that mean? Lets look (next slide)
Mention Enterprise Growth over time, Mobile/Sensor Data, Web 2.0 Data Exhaust, Social NetworksAdvances in Analytics – keep your data around for deeper business insights and to avoid Enterprise Amnesia
How about we summarize a few of the key trends in the Web as we know it today …. This diagram shows some of the main trends of what Web 3.0 is about…Netflix accounts for 29.7 % of US Traffic, Mention Web 2.0 Summit Points of ControlHaving more data leads to better context which leads to deeper understanding/insight or new discoveriesRefer to Reid Hoffman’s views on what web 3.0 is
Pre-processed though, not flexible, you can’t ask specific questions that have not been pre-processed
Mention folksonomies in Web 2.0 with searching Delicious Bookmarks. Mention Chilean Earthquake Crisis Video using Twitter to do Crisis Mapping.
Talk about Visualizations and InfoGraphics – manual and a lot of work
They are only part of the solution & don’t allow you to ask your own questions
This is the real promise of Big Data
These are not all the problems around Big Data. These are the bigger problems around deriving new information out of web data. There are other issues as well likely inconsistency, skew, etc.
Give a Nutch example
Specifically call out the color coding reasoning for Map/Reduce and HDFS as a single distributed service
Give examples of how one might use Open Calais or Entity Extraction libraries