This document provides an introduction to big data concepts. It discusses how traditional architectures have limitations for solving problems involving large weekly increases in data in the petabyte range from diverse sources. Apache Hadoop is presented as a solution using a clustered architecture that is scalable, flexible and cost-efficient. Key aspects of Hadoop covered include its use of commodity hardware, storage of data across clusters of nodes, and benchmarks for sorting large datasets efficiently.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.
This document provides an overview of nonprofit technology essentials presented by NPower Michigan. It discusses computing trends like operating systems and mobile devices. Top nonprofit tech resources like TechSoup.org and Idealware.org are covered. The presentation reviews infrastructure needs such as computers, printers, internet, backup, and phone systems. It also discusses software including office suites, databases, and online collaboration tools. Overall the document aims to educate nonprofits on important technology topics and resources.
The document summarizes the key topics and concepts covered in a learning portfolio project, including word processing, spreadsheets, desktop publishing, PowerPoint, Prezi, ergonomics, fonts, encryption, hacking, VoIP, search engines, operating systems, cookies, hyperlinks, clip art, cursors, domain names, input/output devices, attachments, and firewalls. Definitions and brief descriptions are provided for each term in point form.
This document provides an introduction and overview of HDFS and MapReduce in Hadoop. It describes HDFS as a distributed file system that stores large datasets across commodity servers. It also explains that MapReduce is a framework for processing large datasets in parallel by distributing work across clusters. The document gives examples of how HDFS stores data in blocks across data nodes and how MapReduce utilizes mappers and reducers to analyze datasets.
The document discusses the benefits of using open-source software in libraries. It explains that open-source software is free to download and use, but may require technical expertise to install and maintain. While there are no initial licensing costs, libraries should consider ongoing costs of support, training, and potential mergers or acquisitions of open-source software companies. Popular open-source alternatives mentioned include operating systems like Ubuntu, photo editors like GIMP, and productivity suites like OpenOffice.
From archive to insight debunking myths of analytics on object storesDean Hildebrand
This document summarizes four common myths about using object stores for analytics and debunks each one. It discusses that data does not need to migrate between Swift and HDFS, object stores can support frameworks beyond just in-memory analytics, they can efficiently support frameworks that require appending to files like Hive and HBase, and object stores are not inherently slow for analytics when used with Swift-on-File. The document demonstrates Swift-on-File, which allows objects stored in Swift to be accessed as files, avoiding unnecessary data movement and enabling analytics in place directly on the object store.
Rich Dietz presented on getting started with the nonprofit cloud. He defined the cloud as computing resources delivered as an internet-accessible service. Some key advantages of the cloud include lower costs, availability from anywhere, scalability, and improved sharing and collaboration. Potential disadvantages include less control, reliance on an internet connection and cloud provider, and security concerns. Dietz recommended nonprofits start with the cloud by testing one area like email or backups, research vendors thoroughly, and ensure proper training.
This document provides an introduction to big data concepts. It discusses how traditional architectures have limitations for solving problems involving large weekly increases in data in the petabyte range from diverse sources. Apache Hadoop is presented as a solution using a clustered architecture that is scalable, flexible and cost-efficient. Key aspects of Hadoop covered include its use of commodity hardware, storage of data across clusters of nodes, and benchmarks for sorting large datasets efficiently.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.
This document provides an overview of nonprofit technology essentials presented by NPower Michigan. It discusses computing trends like operating systems and mobile devices. Top nonprofit tech resources like TechSoup.org and Idealware.org are covered. The presentation reviews infrastructure needs such as computers, printers, internet, backup, and phone systems. It also discusses software including office suites, databases, and online collaboration tools. Overall the document aims to educate nonprofits on important technology topics and resources.
The document summarizes the key topics and concepts covered in a learning portfolio project, including word processing, spreadsheets, desktop publishing, PowerPoint, Prezi, ergonomics, fonts, encryption, hacking, VoIP, search engines, operating systems, cookies, hyperlinks, clip art, cursors, domain names, input/output devices, attachments, and firewalls. Definitions and brief descriptions are provided for each term in point form.
This document provides an introduction and overview of HDFS and MapReduce in Hadoop. It describes HDFS as a distributed file system that stores large datasets across commodity servers. It also explains that MapReduce is a framework for processing large datasets in parallel by distributing work across clusters. The document gives examples of how HDFS stores data in blocks across data nodes and how MapReduce utilizes mappers and reducers to analyze datasets.
The document discusses the benefits of using open-source software in libraries. It explains that open-source software is free to download and use, but may require technical expertise to install and maintain. While there are no initial licensing costs, libraries should consider ongoing costs of support, training, and potential mergers or acquisitions of open-source software companies. Popular open-source alternatives mentioned include operating systems like Ubuntu, photo editors like GIMP, and productivity suites like OpenOffice.
From archive to insight debunking myths of analytics on object storesDean Hildebrand
This document summarizes four common myths about using object stores for analytics and debunks each one. It discusses that data does not need to migrate between Swift and HDFS, object stores can support frameworks beyond just in-memory analytics, they can efficiently support frameworks that require appending to files like Hive and HBase, and object stores are not inherently slow for analytics when used with Swift-on-File. The document demonstrates Swift-on-File, which allows objects stored in Swift to be accessed as files, avoiding unnecessary data movement and enabling analytics in place directly on the object store.
Rich Dietz presented on getting started with the nonprofit cloud. He defined the cloud as computing resources delivered as an internet-accessible service. Some key advantages of the cloud include lower costs, availability from anywhere, scalability, and improved sharing and collaboration. Potential disadvantages include less control, reliance on an internet connection and cloud provider, and security concerns. Dietz recommended nonprofits start with the cloud by testing one area like email or backups, research vendors thoroughly, and ensure proper training.
UW School of Medicine Social Engineering and Phishing AwarenessNicholas Davis
An IT Security presentation I created for faculty and staff of the UW-Madison, School of Medicine, about how to recognize and defend against the threats of complex Phishing and Social Engineering, to protect sensitive digital information.
Phreaks were people interested in exploring and manipulating telephone systems in the 1960s-1980s. They discovered techniques like using a 2600Hz tone or "blue boxes" to place free long-distance calls. This subculture grew with information sharing on bulletin board systems and magazines like TAP. Over time, phreaking merged with computer hacking as phone networks became digitized and more connected to computers. The activities of early phreaks helped reveal vulnerabilities but also led to crackdowns and some engaging in illegal toll fraud.
Stuxnet is a sophisticated malware that targeted Siemens supervisory control and data acquisition (SCADA) systems. It used multiple zero-day exploits to spread via USB devices and network shares to infect SCADA systems indirectly connected to the internet. Stuxnet installed rootkits to hide its files and injected itself into processes to remain undetected while sabotaging its targets. It was the first malware known to target and damage physical infrastructure.
Investigation de cybersécurité avec SplunkIbrahimous
Démonstration d'investigation sur des cyberattaques, dans le contexte d’un SOC, avec l’outil « Splunk ».
Présentation réalisée pour le Security Tuesday de l'ISSA France le 19 mai 2015.
Social engineering : l'art de l'influence et de la manipulationChristophe Casalegno
Une chaîne n’est jamais plus solide que son maillon le plus faible, et ce maillon faible, c’est vous, c’est nous tous. Le social engineering ou ingénierie sociale est un ensemble de méthodes et de techniques permettant au travers d’une approche relationnelle basée sur l‘influence et la manipulation, d’obtenir l’accès à un système d’information ou à des informations confidentielles. Dans la pratique, le pirate exploitera les vulnérabilités humaines et sa connaissance de la cible, de ses clients ainsi que de ses fournisseurs et sous-traitants en utilisant la manipulation, la supercherie et l’influence au travers de différents médias de communication, des réseaux sociaux à la rencontre, en passant par l’email et le téléphone.
Je vous propose de découvrir au travers de cette présentation, les grandes lignes directrices des conférences et sessions de sensibilisation que je donne régulièrement dans ce domaine. J’espère que cela vous permettra de mieux comprendre ce qu’est le social engineering, et le danger que peuvent représenter ceux qui le maîtrise pour votre entreprise ou votre organisation et comment vous en protéger.
The Stuxnet worm was designed to target Siemens industrial control systems used in Iran's uranium enrichment centrifuges. It spread to these systems through infected USB drives and exploited multiple Windows vulnerabilities. It then took control of centrifuges and varied their speeds, damaging around 1,000 centrifuges and slowing Iran's nuclear program. While not intended to spread beyond Iran, it ended up infecting systems in other countries as well through file transfers.
Hadoop Master Class : A concise overviewAbhishek Roy
Abhishek Roy will teach a master class on Big Data and Hadoop. The class will cover what Big Data is, the history and background of Hadoop, how to set up and use Hadoop, and tools like HDFS, MapReduce, Pig, Hive, Mahout, Sqoop, Flume, Hue, Zookeeper and Impala. The class will also discuss real world use cases and the growing market for Big Data tools and skills.
MongoDB is a document database that provides a more flexible schema than relational databases. It allows embedding related data and easier updates than relational databases with object-relational mapping. MongoDB scales horizontally through sharding and provides high availability through replica sets. It supports different consistency models including eventual and strong consistency through write concerns and read preferences.
How Celtra Optimizes its Advertising Platformwith DatabricksGrega Kespret
Leading brands such as Pepsi and Macy’s use Celtra’s technology platform for brand advertising. To inform better product design and resolve issues faster, Celtra relies on Databricks to gather insights from large-scale, diverse, and complex raw event data. Learn how Celtra uses Databricks to simplify their Spark deployment, achieve faster project turnaround time, and empower people to make data-driven decisions.
In this webinar, you will learn how Databricks helps Celtra to:
- Utilize Apache Spark to power their production analytics pipeline.
- Build a “Just-in-Time” data warehouse to analyze diverse data sources such as Elastic Load Balancer access logs, raw tracking events, operational data, and reportable metrics.
- Go beyond simple counting and group events into sequences (i.e., sessionization) and perform more complex analysis such as funnel analytics.
The New Frontier: Optimizing Big Data ExplorationInside Analysis
The Briefing Room with Dr. Robin Bloor and Cirro
Live Webcast on February 11, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=0ec1fa381886313cc06d841015c65898
As information ecosystems continue to expand, businesses are searching for ways to combine traditional analytics with a new source of insight: Big Data. But with data flooding in from all kinds of sources, fast access and performance at scale can easily become an issue. One effective approach for solving this challenge is data federation, a method that involves taking the analytical processing to the data, allowing streamlined access to multiple data sources without the expensive ETL overhead or building of semantic layers.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains how the prevalence of distributed data calls for a new approach to Big Data. He will be briefed by Mark Theissen of Cirro, who will tout his company’s Data Hub, a data federation solution that provides a single point of access to all enterprise data assets without excessive data movements, preprocessing or staging. He will discuss how data federation differs from virtualization and ETL approaches, and demonstrate how a Cirro deployment solves the analytics challenge of integrating data silos across the data center – and the cloud – using the BI tools you already have on your desktop for real-time distributed analytics.
Visit InsideAnlaysis.com for more information.
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
This document summarizes a presentation about the Splice Machine database product. Splice Machine is described as a SQL-on-Hadoop database that is ACID-compliant and can handle both OLTP and OLAP workloads. It provides typical relational database functionality like transactions and SQL on top of Apache Hadoop. Customers reportedly see a 10x improvement in price/performance compared to traditional databases. The presentation provides details on Splice Machine's architecture, performance benchmarks, customer use cases, and support for analytics and business intelligence tools.
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)Bogdan Bocse
The exponential growth of digital audio brings AdsWizz to challenges that relate not only to huge volumes of data, but also to respecting milliseconds constraints around response times and to leveraging rich prediction models. Let us share how big data stores, distributed processing and elastic infrastructures have turned from being the cool trend to being business-as-usual for us.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
The Common BI/Big Data Challenges and Solutions presented by seasoned experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session.
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
The Briefing Room with Dean Abbott and Tableau Software
Live Webcast July 23, 2013
http://www.insideanalysis.com
Today’s desire for analytics extends well beyond the traditional domain of Business Intelligence. That’s partly because business users are realizing the value of mixing and matching all kinds of data, from all kinds of sources. One emerging market driver is Cloud-based data, and the desire companies have to analyze this data cohesively with their on-premise data sets.
Register for this episode of The Briefing Room to learn from Analyst Dean Abbott, who will explain how the ability to access data in the cloud can play a critical role for generating business value from analytics. He’ll be briefed by Ellie Fields of Tableau Software who will tout Tableau’s latest release, which includes native connectors to cloud-based applications like Salesforce.com, Amazon Redshift, Google Analytics and BigQuery. She’ll also demonstrate how Tableau can combine cloud data with other data sources, including spreadsheets, databases, cubes and even Big Data.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
Denny Lee introduced Azure DocumentDB, a fully managed NoSQL database service. DocumentDB provides elastic scaling of throughput and storage, global distribution with low latency reads and writes, and supports querying JSON documents with SQL and JavaScript. Common scenarios that benefit from DocumentDB include storing product catalogs, user profiles, sensor telemetry, and social graphs due to its ability to handle hierarchical and de-normalized data at massive scale.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
UW School of Medicine Social Engineering and Phishing AwarenessNicholas Davis
An IT Security presentation I created for faculty and staff of the UW-Madison, School of Medicine, about how to recognize and defend against the threats of complex Phishing and Social Engineering, to protect sensitive digital information.
Phreaks were people interested in exploring and manipulating telephone systems in the 1960s-1980s. They discovered techniques like using a 2600Hz tone or "blue boxes" to place free long-distance calls. This subculture grew with information sharing on bulletin board systems and magazines like TAP. Over time, phreaking merged with computer hacking as phone networks became digitized and more connected to computers. The activities of early phreaks helped reveal vulnerabilities but also led to crackdowns and some engaging in illegal toll fraud.
Stuxnet is a sophisticated malware that targeted Siemens supervisory control and data acquisition (SCADA) systems. It used multiple zero-day exploits to spread via USB devices and network shares to infect SCADA systems indirectly connected to the internet. Stuxnet installed rootkits to hide its files and injected itself into processes to remain undetected while sabotaging its targets. It was the first malware known to target and damage physical infrastructure.
Investigation de cybersécurité avec SplunkIbrahimous
Démonstration d'investigation sur des cyberattaques, dans le contexte d’un SOC, avec l’outil « Splunk ».
Présentation réalisée pour le Security Tuesday de l'ISSA France le 19 mai 2015.
Social engineering : l'art de l'influence et de la manipulationChristophe Casalegno
Une chaîne n’est jamais plus solide que son maillon le plus faible, et ce maillon faible, c’est vous, c’est nous tous. Le social engineering ou ingénierie sociale est un ensemble de méthodes et de techniques permettant au travers d’une approche relationnelle basée sur l‘influence et la manipulation, d’obtenir l’accès à un système d’information ou à des informations confidentielles. Dans la pratique, le pirate exploitera les vulnérabilités humaines et sa connaissance de la cible, de ses clients ainsi que de ses fournisseurs et sous-traitants en utilisant la manipulation, la supercherie et l’influence au travers de différents médias de communication, des réseaux sociaux à la rencontre, en passant par l’email et le téléphone.
Je vous propose de découvrir au travers de cette présentation, les grandes lignes directrices des conférences et sessions de sensibilisation que je donne régulièrement dans ce domaine. J’espère que cela vous permettra de mieux comprendre ce qu’est le social engineering, et le danger que peuvent représenter ceux qui le maîtrise pour votre entreprise ou votre organisation et comment vous en protéger.
The Stuxnet worm was designed to target Siemens industrial control systems used in Iran's uranium enrichment centrifuges. It spread to these systems through infected USB drives and exploited multiple Windows vulnerabilities. It then took control of centrifuges and varied their speeds, damaging around 1,000 centrifuges and slowing Iran's nuclear program. While not intended to spread beyond Iran, it ended up infecting systems in other countries as well through file transfers.
Hadoop Master Class : A concise overviewAbhishek Roy
Abhishek Roy will teach a master class on Big Data and Hadoop. The class will cover what Big Data is, the history and background of Hadoop, how to set up and use Hadoop, and tools like HDFS, MapReduce, Pig, Hive, Mahout, Sqoop, Flume, Hue, Zookeeper and Impala. The class will also discuss real world use cases and the growing market for Big Data tools and skills.
MongoDB is a document database that provides a more flexible schema than relational databases. It allows embedding related data and easier updates than relational databases with object-relational mapping. MongoDB scales horizontally through sharding and provides high availability through replica sets. It supports different consistency models including eventual and strong consistency through write concerns and read preferences.
How Celtra Optimizes its Advertising Platformwith DatabricksGrega Kespret
Leading brands such as Pepsi and Macy’s use Celtra’s technology platform for brand advertising. To inform better product design and resolve issues faster, Celtra relies on Databricks to gather insights from large-scale, diverse, and complex raw event data. Learn how Celtra uses Databricks to simplify their Spark deployment, achieve faster project turnaround time, and empower people to make data-driven decisions.
In this webinar, you will learn how Databricks helps Celtra to:
- Utilize Apache Spark to power their production analytics pipeline.
- Build a “Just-in-Time” data warehouse to analyze diverse data sources such as Elastic Load Balancer access logs, raw tracking events, operational data, and reportable metrics.
- Go beyond simple counting and group events into sequences (i.e., sessionization) and perform more complex analysis such as funnel analytics.
The New Frontier: Optimizing Big Data ExplorationInside Analysis
The Briefing Room with Dr. Robin Bloor and Cirro
Live Webcast on February 11, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=0ec1fa381886313cc06d841015c65898
As information ecosystems continue to expand, businesses are searching for ways to combine traditional analytics with a new source of insight: Big Data. But with data flooding in from all kinds of sources, fast access and performance at scale can easily become an issue. One effective approach for solving this challenge is data federation, a method that involves taking the analytical processing to the data, allowing streamlined access to multiple data sources without the expensive ETL overhead or building of semantic layers.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains how the prevalence of distributed data calls for a new approach to Big Data. He will be briefed by Mark Theissen of Cirro, who will tout his company’s Data Hub, a data federation solution that provides a single point of access to all enterprise data assets without excessive data movements, preprocessing or staging. He will discuss how data federation differs from virtualization and ETL approaches, and demonstrate how a Cirro deployment solves the analytics challenge of integrating data silos across the data center – and the cloud – using the BI tools you already have on your desktop for real-time distributed analytics.
Visit InsideAnlaysis.com for more information.
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
This document summarizes a presentation about the Splice Machine database product. Splice Machine is described as a SQL-on-Hadoop database that is ACID-compliant and can handle both OLTP and OLAP workloads. It provides typical relational database functionality like transactions and SQL on top of Apache Hadoop. Customers reportedly see a 10x improvement in price/performance compared to traditional databases. The presentation provides details on Splice Machine's architecture, performance benchmarks, customer use cases, and support for analytics and business intelligence tools.
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)Bogdan Bocse
The exponential growth of digital audio brings AdsWizz to challenges that relate not only to huge volumes of data, but also to respecting milliseconds constraints around response times and to leveraging rich prediction models. Let us share how big data stores, distributed processing and elastic infrastructures have turned from being the cool trend to being business-as-usual for us.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
The Common BI/Big Data Challenges and Solutions presented by seasoned experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session.
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
The Briefing Room with Dean Abbott and Tableau Software
Live Webcast July 23, 2013
http://www.insideanalysis.com
Today’s desire for analytics extends well beyond the traditional domain of Business Intelligence. That’s partly because business users are realizing the value of mixing and matching all kinds of data, from all kinds of sources. One emerging market driver is Cloud-based data, and the desire companies have to analyze this data cohesively with their on-premise data sets.
Register for this episode of The Briefing Room to learn from Analyst Dean Abbott, who will explain how the ability to access data in the cloud can play a critical role for generating business value from analytics. He’ll be briefed by Ellie Fields of Tableau Software who will tout Tableau’s latest release, which includes native connectors to cloud-based applications like Salesforce.com, Amazon Redshift, Google Analytics and BigQuery. She’ll also demonstrate how Tableau can combine cloud data with other data sources, including spreadsheets, databases, cubes and even Big Data.
Dapper: the microORM that will change your lifeDavide Mauri
ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!
Denny Lee introduced Azure DocumentDB, a fully managed NoSQL database service. DocumentDB provides elastic scaling of throughput and storage, global distribution with low latency reads and writes, and supports querying JSON documents with SQL and JavaScript. Common scenarios that benefit from DocumentDB include storing product catalogs, user profiles, sensor telemetry, and social graphs due to its ability to handle hierarchical and de-normalized data at massive scale.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
Business in the Driver’s Seat – An Improved Model for IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on September 30, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=bfff40f7c9645fc398770ea11152b148
The fueling of information systems will always require some effort, but a confluence of innovations is fundamentally changing how quickly and accurately it can be done. Gone are long cycle times for development. Today, organizations can embrace a more rapid and collaborative approach for building analytical applications and data warehouses. The key is to have business experts working hand-in-hand with data professionals as the solutions take shape, thus expediting the speed to valuable insights.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the changing nature of information design. He’ll be briefed by WhereScape President Mark Budzinski, who will discuss his company’s data warehouse automation solutions and how they enable collaborative development. He will share use cases that illustrate show aligning business and IT, organizations can enable faster and more agile data warehouse development.
Visit InsideAnlaysis.com for more information.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
A Young Lady's Illustrated Primer to Architecture and Technical Decision-Maki...DevOpsDays Tel Aviv
This document provides tips for making better technical decisions in 3 parts:
1. It discusses accelerating trends in technology including polyglot storage, microservices, and coupling platforms. This creates a paradox of choice and complexity for technical decisions.
2. It outlines principles for technical decisions including ensuring technology serves the mission, resisting software sprawl, optimizing globally not locally, choosing boring proven technologies, and understanding risk appetite.
3. It emphasizes that as companies mature, operational impact must drive decisions more, and to spend risk on key differentiators and celebrate those who remove code as much as those who add features.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Similar to Hofstra University - Overview of Big Data (20)
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
1. Big Data: Technologies &
Challenges Facing Business
Today
A Practical Guide to Getting Started
2. My Name is Sara Robertson Hello
About Me
• I’m the VP of Technology at
CPX Interactive.
• Previously ran the platform
team at Warner Music Group.
• I choose technology based on
cost, efficiency, availability of
talent, valuation potential, long
term outlook, community
support, and other reasons
besides just pure tech.
• I love agile, open source, bio-
hacking, anime and martial
arts movies, emo, audiobooks.
• Favorite tech skills:
optimization, debugging.
About Me + Data
• Started at a mainframe
company doing Nagios-style
server & network monitoring.
• Spent my early years
obsessed with Oracle, then
Postgres, then MySQL.
• Most of my data experience is
in either high-traffic web
applications or back-office data
warehousing.
• I believe that data is about
humans as much as it is about
technology, and if your solution
doesn’t speak to your users
then it’s not really a solution.
5. Four Problems With Data Problems
Too Big Too Fast Too Disparate Too Unwieldy
system
creative_id
cpx_creative_name
viq_creative_name
alt_creative_id
alt_creative_name
size
file_size
click_track
audit_status
type
li_number
li_name
camp_number
camp_name
viq_placement_id
x_cslookup
bigint
string
tinyint
smallint
smallint
tinyint
tinyint
string
tinyint
tinyint
float
tinyint
float
float
float
float
int
tinyint
tinyint
bigint
string
string
string
string
tinyint
tinyint
tinyint
int
int
int
int
string
int
int
float
float
float
int
string
string
float
float
float
int
int
int
int
int
int
int
int
int
int
int
float
tinyint
int
tinyint
tinyint
int
int
int
tinyint
int
int
float
float
string
string
string
float
float
float
float
float
float
float
float
string
string
float
hourlycpxDataSiphonHour.py
auction_id_64
date_time
user_tz_offset
width
height
media_type
fold_position
event_type
imp_type
payment_type
media_cost_dollars_cpm
revenue_type
buyer_spend
buyer_bid
ecp
eap
is_imp
is_learn
predict_type_rev
othuser_id_64
ip_address
ip_address_trunc
geo_country
geo_region
operating_system
browser
language
venue_id
seller_member_id
publisher_id
site_id
site_domain
tag_id
external_inv_id
reserve_price
seller_revenue_cpm
media_buy_rev_share_pct
pub_rule_id
seller_currency
publisher_currency
publisher_exchange_rate
serving_fees_cpm
serving_fees_revshare
buyer_member_id
advertiser_id
brand_id
advertiser_frequency
advertiser_recency
insertion_order_id
campaign_group_id
campaign_id
creative_id
creative_freq
creative_rec
cadence_modifier
can_convert
user_group_id
is_control
control_pct
control_creative_id
is_click
pixel_id
is_remarketing
post_click_conv
post_view_conv
post_click_revenue
post_view_revenue
order_id
external_data
pricing_type
booked_revenue_dollars
booked_revenue_adv_curr
commission_cpm
commission_revshare
auction_service_deduction
auction_service_fees
creative_overage_fees
clear_fees
buyer_currency
advertiser_currency
advertiser_exchange_rate
x_datasiphon
Google’
s qpms
in 2000 Google’
s qpms
in 2011
Old
Table
Layouts
New
Table
Layouts
Old File
Formats
New
File
Formats
Old Pattern of
Change in Data &
Business Processes
(Waterfall!)
New Pattern of
Change, Agile!
6. Data Problems in Advertising Examples
Our statistics at CPX…
• 5.5+ billion impressions per day
• 20+ billion bids per day
• 45+ billion segments per day
• 80+ columns per data stream
• Columns average 25+ bytes
That’s more than 100
Terabytes and 75 Billion
records every day! *
Website
Ad
Server
• Access Logs
• Cookies
• Interactions
…
• Predictions
• Demographi
cs
• Prices…
• Potential Buyers
• Market Trends
• Preferences…
Exchang
es
Bidders
• Bid Parameters
• Wins/Losses
• Ceilings/Floors
…
• Creatives
• Targeting
• Analytics…
• Bid Attempts
• Imp Value
• Market
Demand
• Revenue
• Costs
• Performance…
• Profit
• Winners
• Demographics
Life of a
Single Ad
Impressio
n
7. Data Problems in Music Examples
Our statistics at Warner…
• 5+ million radio plays daily
• 10+ million digital tracks sold
daily
• $1+ million in ecommerce daily
• 20+ million fans online
• 10-20 channels of interaction
with every fan
• Thousands of feeds of data that
could potentially mention a band
An essentially unlimited
supply of new data
streams with ever-
changing data formats!
1:07:01pm Radio Plays in Seattle:
MBUBLE HAVENT MET YOU YE
BUBLE´, MICHAEL
HAVEN’TMETYOUYET
1:07:02pm On Twitter in Cyberspace:
OMG this song is so sick! <3 #mbuble
#haventmet
This met you yet Bubble´ song makes me
sick.1:07:03pm On Website from China:
12 visitors to the website.
1:07:00pm On TV in New York:
Michael Buble´ appears on Oprah.
• First match up the many different
versions of the artist’s name
• Then Analyze sentiment to tell the
difference between uses of “sick”
• Then Augment sparse data streams
with useful dimensions (time,
location)
• Then decide how to correlate data!
8. Document
Store
Distributed File
System
Used for
unstructured fast-
flowing data
Massively
Parallel
MPPs, Used for structured high-
volume, high read + write data
Master/Sl
ave
Used for heavy read apps with normal
data volume & scale requirements
In-Memory
Columnar
Used for super
high speed read
only access to
cacheable data
Database Types
Document stores export into
Parallel and Master databases,
which cache into Columnar
databases.
Unstructured
Structured
Key-Value Pair
Used for semi-
structured fast-
flowing data
9.
10. What we tried in Music Case Study
Roll-Your-Own approach
• Python + RabbitMQ +
MongoDB + PHP for
custom BI layer
• Custom development of
workflow, transformation,
storage, correlation,
smoothing, analysis
• Custom dev of dashboards,
reports, charts, etc for the
business
Why it didn’t work
• Bleeding edge technologies
were too immature and cost
of talent was too high
• Outsourced dev + insourced
support = fail
• Too much overhead to get a
usable product
11. What we’re doing in Advertising Case Study
Use-A-Stack approach
• Leverage a kick-start with
a stack that reduces
implementation time,
learning curve, and talent
costs
• Write pluggable modules
• Build the plan for multi-
layered data storage from
the beginning
Why it’s working
• By keeping our
investment and footprint
light, we’re able to
respond quickly to
changes in the industry &
technology ecosystem
• The multiple layers of
data are the key to
building products at scale
12. Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, Perl, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush
Make, MAMP, Confluence,
Agile / Scrum, SOASTA, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
Data War ehouse
hosting
web platfor m
pr oducts
Development
3r d par ty integr ations
R&D
Custom
Modules
Custom
Themes
Contr ib
Modules
Contr ib
Themes
Contr ib
Cor e
Custom
Cor es
Contr ib
Libr ar ies
Glue
Code
custom SQL
Contr ib
Ser vices
Custom
Code
custom
Scr ipts
Paid
Ser vices
Contr ib
Tools
Drupal, Wordpress, PHP,
Javascript, jQuery, HTML, CSS,
Flash, Bash, etc...
MySQL, PostgreSQL, Hadoop,
Cloudera, Hive, Hue, Impala,
Python, Java, SQL, etc...
Amazon, Ubuntu, Apache2,
Nginx, Node.js, Memcache,
Highwinds CDN, etc…
Appnexus, Right Media, Google,
Salesforce, Zendesk, Microso ,
Chrome, Dropbox, etc...
Git, VSphere, VMware, Drush,
Make, MAMP, Confluence,
Scrum, etc...
Mobile, Video, Bidders, IPs,
Viewability, Emerging Tech...
What does our stack look like?
We
only
build
the
red
stuff!
13. Hadoop Distributed File System
HDFS: Everybody’s Doing
It
– It’s just a file system!
– Feed it gzips, csvs,
whatever you’ve got
– Command line + library
interface to read/write files
to it
– Can be slow due to
replication across network
to data nodes
– Not much different than
sed/awk
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Name
Node
14. Parallel DBs RDBMS
The Holy Grail of
Database Scalability
Claims of database parallelization in the
past have been greatly exaggerated.
Nevertheless I believe we might be
dawning on a new era in this space.
Paths to parallelization
Sharding: Manual split of tables into
independent db instances. Joins across
dbs not possible without manual extract
and re-load into one instance.
Federation: Automatic split of tables
into independent db instances. Joins
across dbs managed by high-level
software layer that extracts data and
joins/merges outside the db instances.
Performance penalty in data
extraction/merge. Much redundant
work performed by each db instance in
parsing and compiling SQL.
True-MPP: Only one db instance, with
multiple compute/storage nodes. All
Joins across nodes are managed
natively by the execution engine within
the db instance. No redundant work
performed, no performance penalties.
Technique
:
Degree of Automation: Vendor: Price:
Manual Semi-
Manual
Fully
Auto
Sharding X Various… Low
Federation X GreenPlum High
True-MPP X Netezza High
True-MPP X XtremeData Low
* Thanks to Ravi the CTO of Xtreme Data for contributing this break-down!
15. Traditional DBs RDBMS
MySQL Is Your Best Friend
• Feeding it from a
warehouse is the hardest
part; needs workflow
software and reduce jobs
• Works great for read-
heavy web applications
• Cheap talent, cheap
hosting, tons of support
• Creativity required for
heavy writes, i.e. node.js
+ queuing mechanism
16. In-Memory Columnar DBs In-Memory
The New Memcached
• In memory DBs or
“Columnar” databases are
just key-value pairs:
put(‘name’, ‘value’)
• Some sophisticated layers
have been built on top to
turn it into near-SQL
• Crazy fast solution for read-
heavy systems like analytics
• Still needs workflow,
management, and a
traditional backend storage
system
17. Big Data Distribution Requirements Hosting
Massive Cheap Infrastructure
• Crazy virtual server farms!
1000+ servers get created and
destroyed to perform 1 job
• Automation and deployment of
these servers is crucial,
infrastructure automation is the
new hot skill
• Small-to-medium systems or
growing products use cloud
first and only invest in metal
once stabilized, and even then
it’s rarely cost effective
• Connections between the
servers drives the performance
of your data warehouse
solution!
Life in the Cloud:
It’s so different. Forget
everything you thought you
knew. Except Unix.
Major Bottlenecks:
• Reading & writing to disk: disks are
usually network-connected to cloud-
based servers
• Communicating with other servers
during replication; need to shave off
milliseconds with optimizations
• Staying ahead of storage space
limitations with archive jobs
• Partitioning large datasets based on
primary reduce filters
• Keeping up with your dataset when you
start to get behind
18. Coordination Strategies Implementation
Amazon Hosted Cloud
• It’s like a treasure hunt.*
Vsphere/Openstack Private Cloud
• Roll your own someday.
Cloudera Hadoop Management
• You will love life.
Chef Deployment Automation
• OMG life gets better.
* Note: Rackspace is also awesome.
19. Workflow Strategies Implementation
You Still Need Data In & Out
• Hive/Pig – You definitely need them
– SQL sits on top of Hadoop so you
can query flat files like a table!
– Outputs into RDBMS is easy, but
managing the jobs is hard
– Nobody wants to learn Map Reduce
• Custom Coding
– Long term supportability is low
– High cost & slow to market
• Cloudera has Workflow Services!
– Impala
– Flume
No one is really the standard in this space yet, although there are a lot
of really interesting players. Check out the Big Data chart for more!
20. The Big Data Adoption Problem Adoption
Problems:
• People don’t know what to do with the data
or how to gain insights
• The data changes too fast for traditional
software development; they don’t know what
they want until they see it, they can’t see it
until they tell you what they want!
• If it can’t feel the benefits of the
infrastructure, the business can’t continue to
invest in big data
Solutions:
• Open up windows into the workflow so
humans can dig around and discover things
in the data, teach everyone SQL
• Provide simple BI and visualization solutions
that don’t require custom development
• Support the classical Excel part of the
business world, and make your data
accessible in tabular exports
• Continue development on custom reporting
platforms, learning from the first three steps
along the way
21. Fancy Stuff Adoption
If you’ve come this far you can finally have…
• Statistical modeling
• Sentiment analysis
• Prediction algorithms
• Machine learning
• Mmmmmmm fun stuff…
BUT NOT UNTIL YOU CAN SUPPORT IT!!!!
22. The Most Important Things to Know Cheatsheet
• It’s still all about the reads vs. writes
• HDFS is just a file system format for documents
• Hadoop is just for crunching and outputting into normal
databases, you don’t actually point an application at it
• MPPs are awesome and the wave of the future
• In-memory columnar databases are all the rage
(because they’re crazy fast) and will probably be a
requirement for all high-scale apps in the future
• Don’t forget to become awesome at Unix system &
network administration, because all the same commands
work in the cloud and it’s the only way to understand
what’s going on underneath the hood!
23. What to do right now Try it Out
• Download Openstack and install it on your laptop OR Register for
Amazon AWS
• In your new Cloud:
– Download & install Cloudera Community
– Spin up a few servers & add them to Cloudera
– Find Open Source Xtreme Data MPP in the Marketplace
• Get more advanced:
– Setup a Chef implementation, try automating a few server spin-
up & spin-downs
– Try the open source Druid in-memory DB
– Setup a node.js server w/ Express and pipe in some real-time
data
– Write a real-time data analytics front-end to see if it works!
• Where to get help?
– Forums are your best friend!
– IRC is your worst enemy but it’s still there for you!
– Wikipedia, Youtube, etc all have great resources to learn.
Editor's Notes
Would Michael Buble stepping on the stage to an east coast audience inspire 12 visitors to the website from china 3 seconds later???
It’s the same as a decision to use a web framework instead of write a new session handler over and over… who wants to write workflow & job management solutions from scratch?