Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Cloudera, Inc.
SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI's experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. Integration, performance, and management all become issues at scale, and Hadoop clusters scale! In this presentation, SGI will discuss representative customer use cases, major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.
The document discusses MySQL replication. It defines two types of replication - statement-based and row-based replication. It explains that replication works by recording changes in the master's binary log and replaying the log on slaves. It provides steps for configuring replication including setting up accounts, configuring the master and slave, and instructing the slave to connect to the master. It also lists some benefits of replication like data distribution, load balancing, backups, and high availability.
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
The document discusses big data and MapR's big data solutions. It provides an overview of key big data concepts like the growth of digital data, common use cases, and the big data analytics lifecycle. It also summarizes MapR's enterprise-grade platform for Hadoop, highlighting features like high availability, security, and support for real-time and batch processing workloads. Example customer implementations from HP and Cisco are described that demonstrate how MapR has helped companies gain business insights from large volumes of diverse data.
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
1) The document discusses big data strategies and technologies including Oracle's big data solutions. It describes Oracle's big data appliance which is an integrated hardware and software platform for running Apache Hadoop.
2) Key technologies that enable deeper analytics on big data are discussed including advanced analytics, data mining, text mining and Oracle R. Use cases are provided in industries like insurance, travel and gaming.
3) An example use case of a "smart mall" is described where customer profiles and purchase data are analyzed in real-time to deliver personalized offers. The technology pattern for implementing such a use case with Oracle's real-time decisions and big data platform is outlined.
Log Analytics and Application Insights can help with monitoring and managing integration solutions built with Microsoft technologies. They provide performance monitoring of APIs, functions, logic apps and other components. While end-to-end tracing has some limitations, the tools allow for custom logging, out-of-box views of data, and testing the availability of key applications and services.
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Cloudera, Inc.
SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI's experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. Integration, performance, and management all become issues at scale, and Hadoop clusters scale! In this presentation, SGI will discuss representative customer use cases, major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.
The document discusses MySQL replication. It defines two types of replication - statement-based and row-based replication. It explains that replication works by recording changes in the master's binary log and replaying the log on slaves. It provides steps for configuring replication including setting up accounts, configuring the master and slave, and instructing the slave to connect to the master. It also lists some benefits of replication like data distribution, load balancing, backups, and high availability.
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
The document discusses big data and MapR's big data solutions. It provides an overview of key big data concepts like the growth of digital data, common use cases, and the big data analytics lifecycle. It also summarizes MapR's enterprise-grade platform for Hadoop, highlighting features like high availability, security, and support for real-time and batch processing workloads. Example customer implementations from HP and Cisco are described that demonstrate how MapR has helped companies gain business insights from large volumes of diverse data.
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
1) The document discusses big data strategies and technologies including Oracle's big data solutions. It describes Oracle's big data appliance which is an integrated hardware and software platform for running Apache Hadoop.
2) Key technologies that enable deeper analytics on big data are discussed including advanced analytics, data mining, text mining and Oracle R. Use cases are provided in industries like insurance, travel and gaming.
3) An example use case of a "smart mall" is described where customer profiles and purchase data are analyzed in real-time to deliver personalized offers. The technology pattern for implementing such a use case with Oracle's real-time decisions and big data platform is outlined.
Log Analytics and Application Insights can help with monitoring and managing integration solutions built with Microsoft technologies. They provide performance monitoring of APIs, functions, logic apps and other components. While end-to-end tracing has some limitations, the tools allow for custom logging, out-of-box views of data, and testing the availability of key applications and services.
How to select a modern data warehouse and get the most out of it?Slim Baltagi
In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.
We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.
In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
O'Reilly ebook: Operationalizing the Data LakeVasu S
Best practices for building a cloud data lake operation—from people and tools to processes
https://www.qubole.com/resources/ebooks/ebook-operationalizing-the-data-lake
Federated data architecture involves integrating data from multiple disparate sources to provide a logically integrated view. It allows existing systems to continue operating while being modernized. The US Air Force implemented a federated data solution to manage its $40 billion budget across 100 global locations. It integrated financial data from over 20 legacy systems and provided 15,000 users with real-time access and ad hoc querying capabilities while maintaining high performance.
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Barbara Zigman has over 25 years of experience in telecommunications management positions involving business
development, sales, marketing, and product management. She has worked for several service providers and has led
teams supporting the sale of complex technical products and services. Her technical expertise includes fiber networks,
TDM networks, IP networking, PBX/VoIP systems, and wireless technologies.
This document provides a sector roadmap for cloud analytic databases in 2017. It discusses key topics such as usage scenarios, disruption vectors, and an analysis of companies in the sector. Some main points:
- Cloud databases can now be considered the default option for most selections in 2017 due to economics and functionality.
- Several newer cloud-native offerings have been able to leapfrog more established databases through tight integration of cloud features like elasticity and separation of compute and storage.
- While traditional database functionality is still required, cloud dynamics are causing needs for capabilities like robust SQL support, diverse data support, and dynamic environment adaptation.
- Vendor solutions are evaluated on disruption vectors including SQL support, optimization, elasticity, environment
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3dmOHyQ
Historically, data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately, the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multi-purpose data lakes are the future of data analysis for a broad range of business users.
Watch on-demand this webinar to learn:
- The restrictions of physical single-purpose data lakes
- How to build a logical multi-purpose data lake for business users
- The newer use cases that make multi-purpose data lakes a necessity
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
Apache Hadoop is revolutionizing business intelligence and data analytics by providing a scalable and fault-tolerant distributed system for data storage and processing. It allows businesses to explore raw data at scale, perform complex analytics, and keep data alive for long-term analysis. Hadoop provides agility through flexible schemas and the ability to store any data and run any analysis. It offers scalability from terabytes to petabytes and consolidation by enabling data sharing across silos.
Building the Enterprise Data Lake: A look at architecturemark madsen
The document discusses considerations for building an enterprise data lake. It notes that traditional data warehousing approaches do not scale well for new data sources like sensors and streaming data. It advocates adopting a data lake approach with separate systems for data acquisition, management, and access instead of a monolithic architecture. A data lake requires a distributed architecture and platform services to support various data flows, formats, and processing needs. The data architecture should not enforce models or limitations upfront but rather allow for evolution and change over time.
(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...Hiram Fleitas León
- TITLE:
Using Machine Learning and Python in SQL Server To Predict The Sentiment
Speaker: Fleitas, Hiram
- ABSTRACT:
In this session, I'm very excited to show you from start to finish how to use Machine Learning to predict a sentiment in real-time with SQL Server (On-Premise).
- AGENDA:
1. Add ML Features
2. Grant Access
3. Config
4. Install Pre-Trained & Open Source ML Models (DNN)
5. Code in Python and T-SQL
6. Python Profiling
7. Real-time scoring
8. Review Sentiment Results
9. Resources
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
1) The document discusses how Yelp analyzed their S3 access logs stored in AWS to optimize their cloud storage costs.
2) They used Spark to convert the log files to Parquet format for easier analysis. AWS Athena was then used to run SQL queries on the Parquet files to understand access patterns and age of data.
3) This analysis found that 20% of their data was rarely accessed after 90-400 days and could be moved to the cheaper Infrequent Access storage tier, while 50% of their data was over 400 days old and could be archived to Glacier to reduce costs by around 25% ongoing.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
The RNC recently tackled a massive data migration that will help them scale tremendously to support national campaigns at every level of government. Convergence Consulting Group supported the RNC in migrating their data from legacy on prem. systems to a Microsoft Azure Cloud data warehouse. The RNC and its partners can now utilize Microsoft Power BI to expose the data from anywhere with a few simple clicks. See some examples of recent polling data in the presentation. Questions? Contact us at (813) 265-3239.
The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.
Implement Big Data Testing in Order to Successfully Generate Analytics. This Blog is ideal for software testers and anyone else who wants to understand big data testing.
This document provides an introduction to a course on big data and analytics. It outlines the instructor and teaching assistant contact information. It then lists the main topics to be covered, including data analytics and mining techniques, Hadoop/MapReduce programming, graph databases and analytics. It defines big data and discusses the 3Vs of big data - volume, variety and velocity. It also covers big data technologies like cloud computing, Hadoop, and graph databases. Course requirements and the grading scheme are outlined.
How to select a modern data warehouse and get the most out of it?Slim Baltagi
In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.
We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.
In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
O'Reilly ebook: Operationalizing the Data LakeVasu S
Best practices for building a cloud data lake operation—from people and tools to processes
https://www.qubole.com/resources/ebooks/ebook-operationalizing-the-data-lake
Federated data architecture involves integrating data from multiple disparate sources to provide a logically integrated view. It allows existing systems to continue operating while being modernized. The US Air Force implemented a federated data solution to manage its $40 billion budget across 100 global locations. It integrated financial data from over 20 legacy systems and provided 15,000 users with real-time access and ad hoc querying capabilities while maintaining high performance.
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Barbara Zigman has over 25 years of experience in telecommunications management positions involving business
development, sales, marketing, and product management. She has worked for several service providers and has led
teams supporting the sale of complex technical products and services. Her technical expertise includes fiber networks,
TDM networks, IP networking, PBX/VoIP systems, and wireless technologies.
This document provides a sector roadmap for cloud analytic databases in 2017. It discusses key topics such as usage scenarios, disruption vectors, and an analysis of companies in the sector. Some main points:
- Cloud databases can now be considered the default option for most selections in 2017 due to economics and functionality.
- Several newer cloud-native offerings have been able to leapfrog more established databases through tight integration of cloud features like elasticity and separation of compute and storage.
- While traditional database functionality is still required, cloud dynamics are causing needs for capabilities like robust SQL support, diverse data support, and dynamic environment adaptation.
- Vendor solutions are evaluated on disruption vectors including SQL support, optimization, elasticity, environment
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3dmOHyQ
Historically, data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately, the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multi-purpose data lakes are the future of data analysis for a broad range of business users.
Watch on-demand this webinar to learn:
- The restrictions of physical single-purpose data lakes
- How to build a logical multi-purpose data lake for business users
- The newer use cases that make multi-purpose data lakes a necessity
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
Apache Hadoop is revolutionizing business intelligence and data analytics by providing a scalable and fault-tolerant distributed system for data storage and processing. It allows businesses to explore raw data at scale, perform complex analytics, and keep data alive for long-term analysis. Hadoop provides agility through flexible schemas and the ability to store any data and run any analysis. It offers scalability from terabytes to petabytes and consolidation by enabling data sharing across silos.
Building the Enterprise Data Lake: A look at architecturemark madsen
The document discusses considerations for building an enterprise data lake. It notes that traditional data warehousing approaches do not scale well for new data sources like sensors and streaming data. It advocates adopting a data lake approach with separate systems for data acquisition, management, and access instead of a monolithic architecture. A data lake requires a distributed architecture and platform services to support various data flows, formats, and processing needs. The data architecture should not enforce models or limitations upfront but rather allow for evolution and change over time.
(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...Hiram Fleitas León
- TITLE:
Using Machine Learning and Python in SQL Server To Predict The Sentiment
Speaker: Fleitas, Hiram
- ABSTRACT:
In this session, I'm very excited to show you from start to finish how to use Machine Learning to predict a sentiment in real-time with SQL Server (On-Premise).
- AGENDA:
1. Add ML Features
2. Grant Access
3. Config
4. Install Pre-Trained & Open Source ML Models (DNN)
5. Code in Python and T-SQL
6. Python Profiling
7. Real-time scoring
8. Review Sentiment Results
9. Resources
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
1) The document discusses how Yelp analyzed their S3 access logs stored in AWS to optimize their cloud storage costs.
2) They used Spark to convert the log files to Parquet format for easier analysis. AWS Athena was then used to run SQL queries on the Parquet files to understand access patterns and age of data.
3) This analysis found that 20% of their data was rarely accessed after 90-400 days and could be moved to the cheaper Infrequent Access storage tier, while 50% of their data was over 400 days old and could be archived to Glacier to reduce costs by around 25% ongoing.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
The RNC recently tackled a massive data migration that will help them scale tremendously to support national campaigns at every level of government. Convergence Consulting Group supported the RNC in migrating their data from legacy on prem. systems to a Microsoft Azure Cloud data warehouse. The RNC and its partners can now utilize Microsoft Power BI to expose the data from anywhere with a few simple clicks. See some examples of recent polling data in the presentation. Questions? Contact us at (813) 265-3239.
The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.
Implement Big Data Testing in Order to Successfully Generate Analytics. This Blog is ideal for software testers and anyone else who wants to understand big data testing.
This document provides an introduction to a course on big data and analytics. It outlines the instructor and teaching assistant contact information. It then lists the main topics to be covered, including data analytics and mining techniques, Hadoop/MapReduce programming, graph databases and analytics. It defines big data and discusses the 3Vs of big data - volume, variety and velocity. It also covers big data technologies like cloud computing, Hadoop, and graph databases. Course requirements and the grading scheme are outlined.
This document discusses big data and how new data models are disrupting traditional approaches. It notes that while the new models are initially difficult to understand and threaten existing investments, they are capable of processing large volumes of data quickly. The document examines concepts like Hadoop, NoSQL, and how relational and non-relational approaches can work together in a hybrid environment. It concludes that trends point to more unified support of different data types and expanded capabilities in systems like real-time analytics and embedded search.
The document discusses big data analytics and related topics. It provides definitions of big data, describes the increasing volume, velocity and variety of data. It also discusses challenges in data representation, storage, analytical mechanisms and other aspects of working with large datasets. Approaches for extracting value from big data are examined, along with applications in various domains.
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
Michael Wrinn
Research Program Director, University Research Office,
Intel Corporation
Jason Dai
Engineering Director and Principal Engineer,
Intel Corporation
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
The document discusses big data analytics in the cloud, including definitions of big data and analytics. It covers technologies like Hadoop, Dremel, and Storm, and how they can be used for business intelligence, operational intelligence, and value creation. It also discusses architecture considerations for big data analytic systems in the cloud, including data transfer speeds. The presentation aims to provide an overview of approaches for near real-time business intelligence and analytics using these technologies, both their applicability and limitations when used in the cloud.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
Architecting Virtualized Infrastructure for Big DataRichard McDougall
This document discusses architecting virtualized infrastructure for big data. It notes that data is growing exponentially and that the value of data now exceeds hardware costs. It advocates using virtualization to simplify and optimize big data infrastructure, enabling flexible provisioning of workloads like Hadoop, SQL, and NoSQL clusters on a unified analytics cloud platform. This platform leverages both shared and local storage to optimize performance while reducing costs.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersDenodo
Watch full webinar here: https://buff.ly/2Mt555e
Historically data lakes have been created as centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In his recent whitepaper, renowned analyst Rick F. Van Der Lans talks about why decentralized multi purpose data lakes are the future of data analysis for a broad range of business users.
Please attend this session to learn:
• The restrictions of physical single purpose data lakes
• How to build a logical multi purpose data lake for business users
• The newer use cases that makes multi purpose data lakes a necessity
Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3aePFcF
Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users.
Attend this session to learn:
- The restrictions of physical single purpose data lakes
- How to build a logical multi purpose data lake for business users
- The newer use cases that makes multi purpose data lakes a necessity
Presentation architecting virtualized infrastructure for big datasolarisyourep
The document discusses how virtualization can help simplify big data infrastructure and analytics. Key points include:
1) Virtualization can help simplify big data infrastructure by providing a unified analytics cloud platform that allows different data frameworks and workloads to easily share resources.
2) Hadoop performance on virtualization has been proven with studies showing little performance overhead from virtualization.
3) A unified analytics cloud platform using virtualization can provide benefits like better utilization, faster provisioning of elastic resources, and multi-tenancy for secure isolation of analytics workloads.
Every second of every day you hear about Electronic systems creating ever increasing quantities of data. Systems in markets such as finance, media, healthcare, government and scientific research feature strongly in the Big Data processing conversation. While extracting business value from Big Data is forecast to bring customer and competitive advantage and benefits. In this session hear Vas Kapsalis, NetApp Big Data Business Development Manager, discuss his views and experience on the wider world of Big Data.
Concepts of Big data and some of the tools used for solving Big Data is Explained in this Presentation
Similar to Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridge: Privacy Preservation in the Context of Big Data Processing (20)
You are invited to submit your original and innovative work to “the 1st international conference on Internet Science”, to be organized from April 10 to 11 in Brussels, under the aegis of the European Commission, by the EINS Network of Excellence (http://www.internet-science.eu) and with the support of KVAB.
Social life in digital societies: Trust, Reputation and Privacy EINS summer s...i_scienceEU
Ralph Holz (Technische Universitat Munchen)
Pablo Aragon (Barcelona Media)
Katleen Gabriels (IBBT-SMIT, Vrije Univeriteit Brussel)
Janet Xue (Macquaire University)
Anna Satsiou (Centre for Research and Technology Hellas- Information Technologies Institute)
Sorana Cimpan (Universite De Savoie)
Norbert Blenn (Delft University of Technology)
More information: http://www.internet-science.eu/
Privacy 2020 (Participants) EINS summer schooli_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Authors:
Tulio de Souza
Kinfemicheal Yilma Desta
Maria Lambrou
Wonjae Lee
Gaia Leli
Kai Samelin
Jat Singh
[participants Communicating Privacy Risks to Users] EINS summer schooli_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Authors:
Noor-ul-hassan Shirazi, Rob Heyman, Alison Cies, Mahdi Asadpour, Mehdi Riahi, Qinghua Wu, Joanna Kulesza, Laura Sartori
Runa Sandvik, The Tor Project, London: Online Anonymity: Before and After th...i_scienceEU
The document discusses online anonymity before and after the Arab Spring. It summarizes how Tor works to provide anonymity and the arms race between censors blocking Tor and Tor developers finding new ways to circumvent censorship. It describes a large increase in Tor usage during the Arab Spring protests from 2010 to 2012 and ongoing blocking efforts by countries like China, Iran, and Kazakhstan. New anti-censorship tools like public key pinning, Obfsproxy, and ooni-probe are working to detect and prevent censorship of Tor.
Karmen Guevara, University of Cambridge: Dimensions of Identity, Trust and Pr...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Joss Wright, Oxford Internet Institute (Plenary): Privacy-Preserving Data Ana...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Jonathan Cave, University of Warwick (Plenary): Agreeing to Disagree About Pr...i_scienceEU
The document discusses how markets can both protect and erode privacy. It notes that private data is increasingly valuable and can be traded on markets for personal data and information. However, not all valuable data needs property rights protection. The document also examines how privacy is a recent invention and new technologies challenge underlying privacy assumptions. It argues for avoiding easy answers and discusses how privacy, trust, security, and identity are interrelated concepts with no simple definitions.
Lizzie Coles-Kemp, Royal Holloway University of London: Privacy Awareness: An...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Joanna Kulesza, University of Lodz: Transboundary Challenges of Privacy Prote...i_scienceEU
Network of Excellence Internet Science Summer School. The theme of the summer school is "Internet Privacy and Identity, Trust and Reputation Mechanisms".
More information: http://www.internet-science.eu/
Network of Excellence in Internet Science (Supported Activities, Stavrakakis,...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (Supported Activities, Callegati, U...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (SEA4, Organisation of open calls, ...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (SEA3, Dissemination & Cooperation,...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (SEA2, Standardisation & Legislatio...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (SEA1, E-presence, Dissemination an...i_scienceEU
This document summarizes the objectives and tasks of the SEA1 work package for the EINS project. The work package aims to facilitate collaboration within the project through various online tools and channels. It will create a collaboration platform using Google Apps to share documents, maintain mailing lists and calendars. It will also develop an institutional website and use social media to disseminate information about the project's activities and engage relevant communities. The work package will focus on setting up these online resources in the first year and ensuring ongoing maintenance and evaluation.
Network of Excellence in Internet Science (Multidisciplinarity and its Implic...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (Multidisciplinarity and its Implic...i_scienceEU
The Network of Excellence in Internet Science aims to achieve a deeper multidisciplinary understanding of the Internet as a societal and technological artefact.
More information: http://internet-science.eu/
Twitter: @i_scienceEU
Network of Excellence in Internet Science (Multidisciplinarity and its Implic...i_scienceEU
The document discusses the role of human sciences in future internet design. It notes that the internet has evolved from a machine network focused on host-to-host communication to a human-centric network that blends digital and physical worlds. It is emerging as a tool to study human behavior at large scales. The structure and content of online social networks provides insights into how people interact and share information. Understanding human cognition and decision-making can inform self-aware systems for content-centric internets. A new multi-disciplinary approach is needed combining ICT and human sciences like sociology, anthropology and psychology to account for the human element in internet evolution.
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridge: Privacy Preservation in the Context of Big Data Processing
1. Privacy Preservation
in the Context of Big Data Processing
Kavé Salamatian Universite de Savoie
Eiko Yoneki University of Cambridge
PART I – Eiko Yoneki
Large-Scale Data Processing
PART II - Kavé Salamatian
Big Data and Privacy
2
1
2. PART I: Large-Scale Data Processing
Eiko Yoneki
eiko.yoneki@cl.cam.ac.uk
http://www.cl.cam.ac.uk/~ey204
Systems Research Group
University of Cambridge Computer Laboratory
Outline
What and Why large data?
Technologies
Analytics
Applications
Privacy Kavé
4
2
3. Source of Big Data
Facebook:
40+ billion photos (100PB)
6 billion messages per day (5-10 TB)
900 million users (1 trillion connections?)
Common Crawl:
Covers 5 million web pages
50 TB data
Twitter Firehose:
350 million tweet/day x 2-3Kb/tweet ~ 1TB/day
CERN
15 PB/year - Stored in RDB
Google:
20PB a day (2008)
ebay
9PB of user data+ >50 TB/day
US census data
Detailed demographic data
Amazon web services
S3 450B objects, peak 290K request/sec
JPMorganChase 5
150PB on 50K+ servers with 15K apps running
3Vs of Big Data
Volume: terabytes even petabytes scale
Velocity: Time sensitive – streaming
Variety: beyond structured data (e.g. text, audio,
video etc.)
Time-sensitive
to maximise its
value
Beyond
structured
data
Terabytes or even
Petabytes scale 6
3
4. Significant Financial Value
SOURCE: McKinsey Global Institute analysis 7
Gnip
Grand central station for Social Web Stream
Aggregate several TB of new social data daily
8
4
5. Climate Corporation
14TB of historical weather data
30 technical staff including 12 PhDs
10,000 sales agents
9
FICO
50+ years of experience doing credit ratings
Transitioning to predictive analytics
10
5
6. Why Big Data?
Hardware and software technologies can
manage ocean of data
Increase of Storage Capacity
Increase of Processing Capacity
Availability of Data
11
Data Storage
12
SOURCE: McKinsey Global Institute analysis
6
7. Computation Capability
13
SOURCE: McKinsey Global Institute analysis
Data from Social networks
14
SOURCE: McKinsey Global Institute analysis
7
8. Issues to Process Big Data
Additional Issues
Usage
Quality
Context
Streaming
Scalability
Data Modalities
Data Operators 15
Outline
What and Why large data?
Technologies
Analytics
Applications
Privacy
16
8
9. Techniques for Analysis
Applying these techniques: larger and more
diverse datasets can be used to generate more
numerous and insightful results than smaller,
less diverse ones
Classification Pattern recognition
Cluster analysis Predictive modelling
Crowd sourcing Regression
Data fusion/integration Sentiment analysis
Data mining Signal processing
Ensemble learning Spatial analysis
Genetic algorithms Statistics
Machine learning Supervised learning
NLP Simulation
Neural networks Time series analysis
Network analysis Unsupervised learning
Optimisation Visualisation
17
Technologies for Big Data
Distributed systems
Cloud (e.g. Amazon EC2 - Infrastructure as a
service)
Storage
Distributed storage (e.g. Amazon S3)
Programming model
Distributed processing (e.g. MapReduce)
Data model/indexing
High-performance schema-free database (e.g.
NoSQL DB)
Operations on big data
Analytics – Realtime Analytics 18
9
10. Distributed Infrastructure
Computing + Storage transparently
Cloud computing, Web 2.0
Scalability and fault tolerance
Distributed servers
Amazon EC2, Google App Engine, Elastic, Azure
E.g. EC2:
Pricing? Reserved, on-demand, spot, geography
System? OS, customisations
Sizing? RAM/CPU based on tiered model
Storage? Quantity, type
Networking / security
Distributed storage
Amazon S3
Hadoop Distributed File System (HDFS)
Google File System (GFS) - BigTable
Hbase
19
Distributed Storage
E.g. Hadoop Distributed File System (HDFS)
High performance distributed file system
Asynchronous replication
Data divided into 64/128 MB blocks (replicated 3 times)
NameNode holds file system metadata
Files are broken up and spread over DataNodes
20
10
11. Distributed Infrastructure
MS Amazon
Azure WS
Google
AppEngine Zookeeper, Chubby
Rackspace, Azure… Manage
Access Pig, Hive, DryadLinq, Java…
Processing MapReduce (Hadoop, Google MR), Dryad
Streaming Haloop…
Semi-
Structured HDFS, GFS, HBase, BigTable, Cassandra
Storage Dynamo
21
Challenges
Big data to scale and build on distribution and
combine theoretically unlimited number of
machines to one single distributed storage
Distribute and shard parts over many machines
Still fast traversal and read to keep related data together
Data store including NoSQL
Scale out instead scale up
Avoid naïve hashing for sharding
Do not depend of the number of nodes
Difficult add/remove nodes
Trade off – data locality, consistency, availability,
read/write/search speed, latency etc.
Analytics requires both real time and post fact
analytics 22
11
12. Hadoop
Founded in 2004 by a Yahoo! Employee
Spun into open source Apache project
General purpose framework for Big Data
MapReduce implementation
Support tools (e.g. distributed storage, concurrency)
Use by everybody…(Yahoo!, Facebook, Amazon, MS, Apple)
23
Amazon Web Services
Launched 2006
Largest most popular cloud computing platform
Elastic Compute Cloud (EC2)
Rent Elastic compute units by the hour: one 1 GH machine
Can choose Linux, FreeBSD, Solaris, and Windows
Virtual private servers running on Xen
Pricing: US$0.02 – 2.50 per hour
Simple Storage Service (S3)
Index by bucket and key
Accessible via HTTP, SOAP and BitTorrent
Over 1 trillion objects now uploaded
Pricing: US$0.05-0.10 per GB per month
Stream Processing Service (S4)
Other AWS:
Elastic MapReduce (Hadoop on EC2 with S3)
SQL Database
Content delivery networks, caching
24
12
13. Distributed Processing
Non standard programming models
Use of cluster computing
No traditional parallel programming models (e.g.
MPI)
New model: e.g. MapReduce
25
MapReduce
Target problem needs to be parallelisable
Split into a set of smaller code (map)
Next small piece of code executed in parallel
Finally a set of results from map operation get
synthesised into a result of the original problem
(reduce)
26
13
14. Task Coordination
Typical architecture utilises a single master
and multiple (unreliable) workers
Master holds state of current configuration,
detects node failure, and schedules work
based on multiple heuristics. Also coordinates
resources between multiple jobs
Workers perform work! Both mapping and
reducing, possibly at the same time
27
Example: Word Count
28
14
16. CIEL: Dynamic Task Graphs
MapReduce prescribes a task graph that can
be adapted to many problems
Later execution engines such as Dryad allow
more flexibility, for example to combine the
results of multiple separate computations
CIEL takes this a step further by allowing the
task graph to be specified at run time – for
example:
while (!converged) spawn(tasks);
Tutorial: http://www.cambridgeplus.net/tutorials/CIEL-DCN/
31
Dynamic Task Graph
Data-dependent control flow
CIEL: Execution engine for dynamic task
graphs
(D. Murray et al. CIEL: a universal execution engine for distributed
data-flow computing, NSDI 2011)
32
16
17. Data Model/Indexing
Support large data
Fast and flexible
Operate on distributed infrastructure
Is SQL Database sufficient?
33
Traditional SQL Databases
Most interesting
queries require
computing joins
34
17
18. NoSQL (Schema Free) Database
NoSQL database
Support large data
Operate on distributed infrastructure (e.g. Hadoop)
Based on key-value pairs (no predefined schema)
Fast and flexible
Pros: Scalable and fast
Cons: Fewer consistency/concurrency
guarantees and weaker queries support
Implementations
MongoDB
CouchDB
Cassandra
Redis
BigTable
Hibase
Hypertable
… 35
Data Assumptions
36
18
19. NoSQL Database
Maintain unique keys per row
Complicated multi-valued columns for rich
query
37
Outline
What and Why large data?
Technologies
Analytics Data Analytics
Applications
Privacy
38
19
20. Do we need new Algorithms?
Can’t always store all data
Online/streaming algorithms
Memory vs. disk becomes critical
Algorithms with limited passes
N2 is impossible
Approximate algorithms
Data has different relation to various other
data
Algorithms for high-dimensional data
39
Complex Issues with Big Data
Because of large amount of data, statistical analysis
might produce meaningless result
Example:
40
20
21. Easy Cases
Sorting
Google 1 trillion items (1PB) sorted in 6 Hours
Searching
Hashing and distributed search
Random split of data to feed M/R operation
But not all algorithms are parallelisable
41
More Complex Case: Stream Data
Have we seen x before?
Rolling average of previous K items
Sliding window of traffic volume
Hot list – most frequent items seen so far
Probability start tracking new item
Querying data streams
Continuous Query
42
21
22. Typical Operation with Big Data
Smart sampling of data
Reducing original data with maintaining statistical
properties
Find similar items efficient
multidimensional indexing
Incremental updating of models support
streaming
Distributed linear algebra dealing with
large sparse matrices
Plus usual data mining, machine learning and
statistics
Supervised (e.g. classification, regression)
Non-supervised (e.g. clustering..) 43
How about Graph Data
Bipartite graph of Airline Graph
appearing phrases Social Networks
in documents
Gene expression
data
Protein Interactions
[genomebiology.com] 44
22
23. Different Algorithms for Graph
Different Algorithms perform differently
BFS
DFS
CC
SCC
SSSP
ASP
MIS Running time in seconds processing the graph
A* with 50million English web pages with 16 servers
Community (from Najork et al WSDM 2012)
Centrality
Diameter
Page Rank
…
45
How to Process Big Graph Data?
Data-Parallel (e.g. MapReduce)
Large datasets are partitioned across machines and replicated
No efficient random access to data
Graph algorithms are not fully parallelisable
Parallel DB
Tabular format providing ACID properties
Allow data to be partitioned and processed in parallel
Graph does not map well to tabular format
Moden NoSQL
Allow flexible structure (e.g. graph)
Trinity, Neo4J
In-memory graph store for improving latency (e.g. Redis,
Scalable Hyperlink Store (SHS)) expensive for petabyte scale
workload
46
23
24. Big Graph Data Processing
MapReduce is not suited for graph processing
Many iterations are needed for
parallel graph processing
Intermediate results at every
MapReduce iteration harm Tool Box
performance CC
Graph specific data parallel
Multiple iterations needed to
explore entire graph
Iterative algorithms common
SSSP
in Machine Learning, graph
analysis BFS
47
Data Parallel with Graph is Hard
Designing Efficient Parallel Algorithms
Avoid Deadlocks on Access to Data
Prevent Parallel Memory Bottlenecks
Requires Efficient Algorithms for Data Parallel
High Level Abstraction Helps MapReduce
But processing millions of data with interdependent computation,
difficult to deploy
Data Dependency and Iterative Operation is Key
CIEL
GraphLab
Graph Specific Data Parallel
Use of Bulk Synchronous Parallel Model
BSP enables peers to communicate only necessary data while
data preserve locality
48
24
25. Bulk Synchronous Parallel Model
Computation is sequence of iterations
Each iteration is called a super-step
Computation at each vertex in parallel
Google Pregel: Vertex-based graph processing;
defining a model based on computing locally at
each vertex and communicating via message
passing over vertex’s available edges
BSP-based: Giraph, HAMA, GoldenORB
49
BSP Example
Finding the largest value in a strongly connected
graph
Local Computation
Message
Communication
Local Computation
Communication
…
50
25
26. Graph, Matrix, Machine Learning
BSP Iterative MapReduce
Matrix Computation
Graph Processing MR
Machine Learning
51
Further Issues on Graph Processing
Lot of work on computation
Little attention to storage
Store LARGE amount of graph structure data (edge lists)
Efficiently move it to computation (algorithm)
Potential solutions:
Cost effective but efficient storage
Move to SSDs from RAM
Reduce latency
Blocking to improve spatial locality
Runtime prefetching
Reduce storage requirements
Compressed Adjacency Lists
52
26
27. Outline
What and Why large data?
Technologies
Analytics
Applications
Privacy
53
Applications
Digital marketing Optimisation (e.g. web
analytics)
Data exploration and discovery (e.g. data
science, new markets)
Fraud detection and prevention (e.g. site
integrity)
Social network and relationship analysis (e.g.
influence marketing)
Machine generated data analysis (e.g. remote
sensing)
Data retention (i.e. data archiving)
54
27
28. Recommendation
55
Online Advertisement
50GB of uncompressed log files
50-100M clicks
4-6M unique users
7000 unique pages with more than 100 hits 56
28
29. Network Monitoring
57
Data Statistics
Leskovec (WWW 2007)
Log data 150GB/day (compressed)
4.5TB of one month data
Activity over June 2006 (30 days)
245 million users logged in
180 million users engaged in conversation
17 million new account activated
More than 30 billion conversation
More than 255 billion exchanged messages
Who talks to whom who talks to whom (duration)
58
29
30. Geography and Communication
59
Visualisation: News Feed
http://newsfeed.ijs.si/visual_demo/
Animation/interactivity often necessary
60
30
31. Visualisation: GraphViz
61
Outline
What and Why large data?
Technologies
Analytics
Applications
Privacy
62
31
32. Privacy
Technology is neither good nor bad, it is
neutral
Big data is often generated by people
Obtaining consent is often impossible
Anonymisation is very hard…
63
You only need 33 bits
Birth date, postcode, gender
Unique for 87% of US population (Sweeney 1997)
Preference in movies
99% of 500K with 8 rating (Narayanan 2007)
Web browser
94% of 500K users (Eckersley)
Writing style
20% accurate out of 100K users (Narayanan 2012)
How to prevent Differential Privacy
64
32
33. Take Away Messages
Big Data seems buzz word but it is everywhere
Increasing capability of hardware and software will make
big data accessible
Potential great data analytics
Can we do big data processing?
Yes, but more efficient processing will be required…
Inter-disciplinary approach is necessary
Distributed systems
Networking
Database
Algorithms
Machine Learning
Privacy ! PART II
65
Acknowledgement
Joe Bonneau (University of Cambridge)
Marko Grobelnik Jozef (Stefan Institute)
Thank You!
66
33
34. PART II: Big Data and Privacy
Kavé Salamatian
Universite de Savoie
Big Data and privacy
Data in relational database
Linkage attack with auxiliary information
e.g. (gender, zip, birthday)
Matrix data de-anonymization
Netflix dataset [NS08]
Graph data de-anonymization
social graph de-anonymization [NS09]
34
35. AOL Privacy Debacle
In August 2006, AOL released anonymized search query logs
657K users, 20M queries over 3 months (March-May)
Opposing goals
Analyze data for research purposes, provide better services for users and
advertisers
Protect privacy of AOL users
Government laws and regulations
Search queries may reveal income, evaluations, intentions to acquire
goods and services, etc.
AOL User 4417749
AOL query logs have the form
<AnonID, Query, QueryTime, ItemRank, ClickURL>
ClickURL is the truncated URL
NY Times re-identified AnonID 4417749
Sample queries: “numb fingers”, “60 single men”, “dog that urinates on everything”,
“landscapers in Lilburn, GA”, several people with the last name Arnold
Lilburn area has only 14 citizens with the last name Arnold
NYT contacts the 14 citizens, finds out AOL User 4417749 is 62-year-old Thelma
Arnold
35
36. Netflix Prize Dataset
Netflix: online movie rental service
In October 2006, released real movie ratings of 500,000 subscribers
10% of all Netflix users as of late 2005
Names removed
Information may be perturbed
Numerical ratings as well as dates
Average user rated over 200 movies
Task is to predict how a user will rate a movie
Beat Netflix’’s algorithm (called Cinematch) by 10%
You get 1 million dollars
Netflix Prize
Dataset properties
17,770 movies
480K people
100M ratings
3M unknowns
40,000+ teams
185 countries
$1M for 10% gain
36
37. How do you rate a movie?
Report global average
I predict you will rate this movie 3.6 (1-5 scale)
Algorithm is 15% worse than Cinematch
Report movie average (Movie effects)
Dark knight: 4.3, Wall-E: 4.2, The Love Guru: 2.8, I heart Huckabees: 3.2,
Napoleon Dynamite: 3.4
Algorithm is 10% worse than Cinematch
User effects
Find each user’’s average
Subtract average from each rating
Corrects for curmudgeons and Pollyannas
Movie + User effects is 5% worse than Cinematch
More sophisticated techniques use covariance matrix
37
38. Netflix Dataset: Attributes
Most popular movie rated by
almost half the users!
Least popular: 4 users
Most users rank movies outside
top 100/500/1000
Why is Netflix database private?
Provides some
anonymity
Privacy question: what
can the adversary learn
by combining with
background
knowledge?
No explicit identifiers
38
39. Netflix’’s Take on Privacy
Even if, for example, you knew all your own
ratings and their dates you probably couldn’’t
identify them reliably in the data because only
a small sample was included (less than one-
tenth of our complete dataset) and that data
was subject to perturbation. Of course, since
you know all your own ratings that really isn’’t a
privacy problem is it?
-- Netflix Prize FAQ
Background Knowledge (Aux. Info.)
Information available to adversary outside of normal data release
process
Aux
Target
Target
Noisy
Public databases
39
40. De-anonymization Objective
Fix some target record r in the original dataset
Goal: learn as much about r as possible
Subtler than ““find r in the released database””
Background knowledge is noisy
Released records may be perturbed
Only a sample of records has been released
False matches
Narayanan & Shmatikov 2008
1 3 2 5 4 1 2 3 2 4
40
41. Using IMDb as Aux
Extremely noisy, some data missing
Most IMDb users are not in the Netflix dataset
Here is what we learn from the Netflix record of one IMDb user (not
in his IMDb profile)
De-anonymizing the Netflix Dataset
Average subscriber has 214 dated ratings
Two is enough to reduce to 8 candidate records
Four is enough to identify uniquely (on average)
Works even better with relatively rare ratings
““The Astro-Zombies”” rather than ““Star Wars””
Fat Tail effect helps here:
most people watch obscure movies
(really!)
41
42. More linking attacks
1 3 2 5 4
Profile 1 Profile 2
in IMDb in AIDS
survivors
online
Anonymity vs. Privacy
Anonymity is insufficient for privacy
Anonymity is necessary for privacy
Anonymity is unachievable in practice
Re-identification attack anonymity breach privacy breach
Just ask Justice Scalia
“It is silly to think that every single
datum about my life is private””
42
43. Beyond recommendations……
Adaptive systems reveal information about users
Social Networks
Sensitivity
Online social network services
Email, instant messenger
Phone call graphs
Plain old real-life relationships
43
44. Social Networks: Data Release
Select Compute
Sanitize Select
subset of induced Publish
edges attributes
nodes subgraph
Attack Model
Large-scale
Publish! Background
Knowledge
44
45. Motivating Scenario: Overlapping Networks
Social networks A and B have overlapping memberships
Owner of A releases anonymized, sanitized graph
say, to enable targeted advertising
Can owner of B learn sensitive information from released graph
A’’?
Re-identification: Two-stage Paradigm
Re-identifying target graph =
Mapping between Aux and target nodes
Seed identification:
Detailed knowledge about small number of nodes
Relatively precise
Link neighborhood constant
In my top 5 call and email list……..my wife
Propagation: similar to infection model
Successively build mappings
Use other auxiliary information
I’’m on facebook and flickr from 8pm-10pm
Intuition: no two random graphs are the same
Assuming enough nodes, of course
45
46. Seed Identification: Background Knowledge
How:
•• Creating sybil nodes
•• Bribing
•• Phishing 4
•• Hacked machines 5
•• Stolen cellphones
What: List of neighbors
Degree
Number of common Degrees: (4,5)
neighbors of two nodes Common nbrs: (2)
Preliminary Results
Datasets:
27,000 common nodes
Only 15% edge overlap
150 seeds
32% re-identified as measured by centrality
12% error rate
46
47. Solutions
Database Privacy
Alice
Users
Bob Collection (government,
and researchers,
““sanitization”” marketers, ……)
You
“Census problem”
Two conflicting goals
Utility: Users can extract ““global”” statistics
Privacy: Individual information stays hidden
How can these be formalized?
47
48. Database Privacy
Alice
Users
Bob Collection (government,
and researchers,
““sanitization”” marketers, ……)
You
Variations on model studied in
Statistics
Data mining
Theoretical CS
Cryptography
Different traditions for what ““privacy”” means
How can we formalize “privacy”?
Different people mean different things
Pin it down mathematically?
Goal #1: Rigor
Prove clear theorems about privacy
Few exist in literature
Make clear (and refutable) conjectures
Sleep better at night
Goal #2: Interesting science
(New) Computational phenomenon
Algorithmic problems
Statistical problems
96
48
49. Basic Setting
x1 query 1
x2 Users
answer 1
x3 San (government,
DB= researchers,
query T marketers, ……)
xn-1
xn ¢¢¢ answer T
random coins
Database DB = table of n rows, each in domain D
D can be numbers, categories, tax forms, etc
E.g.: Married?, Employed?, Over 18?, ……
97
Examples of sanitization methods
Input perturbation
Change data before processing
E.g. Randomized response
flip each bit of table with probability p
Summary statistics
Means, variances
Marginal totals (# people with blue eyes and brown hair)
Regression coefficients
Output perturbation
Summary statistics with noise
Interactive versions of above:
Auditor decides which queries are OK, type of noise
98
49
50. Two Intuitions for Privacy
““If the release of statistics S makes it possible to determine the value [of
private information] more accurately than is possible without access to S, a
disclosure has taken place.”” [Dalenius]
Learning more about me should be hard
Privacy is ““protection from being brought to the attention of others.””
[Gavison]
Safety is blending into a crowd
Problems with Classic Intuition
Popular interpretation: prior and posterior views about an individual shouldn’’t change ““too much””
What if my (incorrect) prior is that every UTCS graduate student has three arms?
How much is ““too much?””
Can’’t achieve cryptographically small levels of disclosure and keep the data useful
Adversarial user is supposed to learn unpredictable things about the database
Straw man: Learning the distribution
Assume x1,……,xn are drawn i.i.d. from unknown distribution
Def’n: San is safe if it only reveals distribution
Implied approach:
learn the distribution
release description of distrib
or re-sample points from distrib
Problem: tautology trap
estimate of distrib. depends on data…… why is it safe?
50
51. Blending into a Crowd
Intuition: I am safe in a group of k or more
k varies (3…… 6…… 100…… 10,000 ?)
Many variations on theme:
Adv. wants predicate g such that
{
0 < # i | g(xi)=true } < k
g is called a breach of privacy
Why?
Fundamental:
R. Gavison: ““protection from being brought to the attention of others””
Rare property helps me re-identify someone
Implicit: information about a large group is public
e.g. liver problems more prevalent among diabetics
101
Blending into a Crowd
Intuition: I am safe in a group of k or more
k varies (3…… 6…… 100…… 10,000 ?)
Two variants:
Many variations on theme: • frequency in DB
Adv. wants predicate g such that • frequency in
{
0 < # i | g(xi)=true } < k underlying
g is called a breach of privacy population
Why?
Fundamental: How can we capture this?
•• Syntactic definitions
R. Gavison: ““protection from being brought to the attention of others””
Rare property helps me re-identify someone
•• Bayesian adversary
Implicit: information about a large group is public
e.g. liver problems more prevalent among diabetics
•• ““Crypto-flavored”” definitions
51
52. Blending into a Crowd
Intuition: I am safe in a group of k or more
pros:
appealing intuition for privacy
seems fundamental
mathematically interesting
meaningful statements are possible!
cons
does it rule out learning
facts about particular individual?
all results seem to make strong assumptions on adversary’’s prior
distribution
is this necessary? (yes……)
Impossibility Result
slide 104 [Dwork]
Privacy: for some definition of ““privacy breach,””
distribution on databases, adversaries A, A’’
such that Pr(A(San)=breach) –– Pr(A’’()=breach)
For reasonable ““breach””, if San(DB) contains information about DB, then some adversary
breaks this definition
Example
Vitaly knows that Josh Leners is 2 inches taller than the average Russian
DB allows computing average height of a Russian
This DB breaks Josh’’s privacy according to this definition…… even if his record is not in the
database!
52
53. Differential Privacy (1)
slide 105
x1 query 1
x2 answer 1
x3 San
DB=
xn-1 query T
xn ¢ ¢ ¢ answer T
Adversary A
random coins
Example with Russians and Josh Leners
Adversary learns Josh’’s height even if he is not in the database
Intuition: ““Whatever is learned would be learned regardless of
whether or not Josh participates””
Dual: Whatever is already known, situation won’’t get worse
Indistinguishability
slide 106
x1 query 1
x2 answer 1
x3 San
DB= transcript
query T S
xn-1
xn ¢ ¢ ¢ answer T
Distance
random coins between
Differ in 1 row
distributions
x1 query 1 is at most
x2 answer 1
y3 San
DB’’= transcript
query T S’
xn-1
xn ¢ ¢ ¢ answer T
random coins
53
54. Diff. Privacy in Output Perturbation
slide 107
User Database
Tell me f(x)
x1
…
f(x)+noise xn
Intuition: f(x) can be released accurately when f is insensitive to
individual entries x1, …… xn
Global sensitivity GSf = maxneighbors x,x’’ ||f(x) –– f(x’’)||1 Lipschitz
Example: GSaverage = 1/n for sets of bits constant of f
Theorem: f(x) + Lap(GSf / ) is -indistinguishable
Noise generated from Laplace distribution
Differential Privacy: Summary
slide 108
K gives -differential privacy if for all values of DB and Me and all
transcripts t:
Pr[ K (DB - Me) = t]
e 1
Pr[ K (DB + Me) = t]
Pr [t]
54
55. Why does this help?
With relatively little noise:
Averages
Histograms
Matrix decompositions
Certain types of clustering
……
Preventing Attribute Disclosure
Various ways to capture
““no particular value should be revealed””
Differential Criterion:
““Whatever is learned would be learned regardless of whether or not
person i participates””
Satisfied by indistinguishability
Also implies protection from re-identification?
Two interpretations:
A given release won’’t make privacy worse
Rational respondent will answer if there is some gain
Can we preserve enough utility?
55
56. Thanks to
Thanks to Vitaly Shmatikov, James Hamilton
Salman Salamatian
56