Use cases and examples using Apache Spark, presented at the Hadoop User Group (UK) November 2014 Hadoop Meetup
http://www.meetup.com/hadoop-users-group-uk/events/217791892/
Apache Big Data Conference 2016, Vancouver BC: Talk by Andreas Zitzelsberger (@andreasz82, Principal Software Architect at QAware)
Abstract: On large-scale web sites, users leave thousands of traces every second. Businesses need to process and interpret these traces in real-time to be able to react to the behavior of their users. In this talk, Andreas shows a real world example of the power of a modern open-source stack. He will walk you through the design of a real-time clickstream analysis PAAS solution based on Apache Spark, Kafka, Parquet and HDFS. Andreas explains our decision making and presents our lessons learned.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Need to start querying data instantly? Amazon Athena an interactive query service that makes it easy to interactive queries on data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately.
In this presentation, we will show you how Amazon Athena makes it easy it is to query your data stored in S3
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Empower Your Security Practitioners with Elastic SIEMElasticsearch
Learn how Elastic SIEM’s latest capabilities enable interactive exploration and automated analysis — all at the speed and scale your security practitioners need to defend your organization.
See the video: https://www.elastic.co/elasticon/tour/2019/washington-dc/empower-your-security-practitioners-with-elastic-siem
Video: https://www.youtube.com/watch?v=v69kyU5XMFI
A talk I gave at the Philly Security Shell meetup 2019-02-21 on how the Elastic Stack works and how you can use it for indexing and searching security logs. Tools I mentioned: Github repo with script and demo data - https://github.com/SecHubb/SecShell_Demo Cerebro - https://github.com/lmenezes/cerebro Elastalert - https://github.com/Yelp/elastalert For info on my SANS teaching schedule visit: https://www.sans.org/instructors/john... Twitter: https://twitter.com/SecHubb
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
Apache Big Data Conference 2016, Vancouver BC: Talk by Andreas Zitzelsberger (@andreasz82, Principal Software Architect at QAware)
Abstract: On large-scale web sites, users leave thousands of traces every second. Businesses need to process and interpret these traces in real-time to be able to react to the behavior of their users. In this talk, Andreas shows a real world example of the power of a modern open-source stack. He will walk you through the design of a real-time clickstream analysis PAAS solution based on Apache Spark, Kafka, Parquet and HDFS. Andreas explains our decision making and presents our lessons learned.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
Need to start querying data instantly? Amazon Athena an interactive query service that makes it easy to interactive queries on data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately.
In this presentation, we will show you how Amazon Athena makes it easy it is to query your data stored in S3
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Empower Your Security Practitioners with Elastic SIEMElasticsearch
Learn how Elastic SIEM’s latest capabilities enable interactive exploration and automated analysis — all at the speed and scale your security practitioners need to defend your organization.
See the video: https://www.elastic.co/elasticon/tour/2019/washington-dc/empower-your-security-practitioners-with-elastic-siem
Video: https://www.youtube.com/watch?v=v69kyU5XMFI
A talk I gave at the Philly Security Shell meetup 2019-02-21 on how the Elastic Stack works and how you can use it for indexing and searching security logs. Tools I mentioned: Github repo with script and demo data - https://github.com/SecHubb/SecShell_Demo Cerebro - https://github.com/lmenezes/cerebro Elastalert - https://github.com/Yelp/elastalert For info on my SANS teaching schedule visit: https://www.sans.org/instructors/john... Twitter: https://twitter.com/SecHubb
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Amazon Elastic File System (Amazon EFS) is a file storage service for Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon EFS is easy to use and provides a simple interface that allows you to create and configure file systems quickly and easily. With Amazon EFS, storage capacity is elastic, growing and shrinking automatically as you add and remove files, so your applications have the storage they need, when they need it.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
SIEM (Security Information and Event Management)Osama Ellahi
In this presentation we cover basic knowledge about siem .
-What is siem
-How It works
-Siem Process
-Siem capabilities
-Some snaps of VARNOIS(Tools that use for getting logs"LOGS aggregation" and then apply some machine algorithms to see about logs that logs are risky OR not).
There are a lot of others vendors also who provided the tools for information and event management.like QRADAR is also one of the best tool by IBM.
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks
At Apple we rely on processing large datasets to power key components of Apple’s largest production services. Spark is continuing to replace and augment traditional MR workloads with its speed and low barrier to entry. Our current analytics infrastructure consists of over an exabyte of storage and close to a million cores. Our footprint is also growing further with the addition of new elastic services for streaming, adhoc and interactive analytics.
In this talk we will cover the challenges of working at scale with tricks and lessons learned managing large multi-tenant clusters. We will also discuss designing and building a self-service elastic analytics platform on Mesos.
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...Amazon Web Services
Are you interested in becoming a IAM policy master and learning about powerful techniques for controlling access to AWS resources? If your answer is “yes,” this session is for you. Join us as we cover the different types of policies and describe how they work together to control access to resources in your account and across your AWS organization. We walk through use cases that help you delegate permission management to developers by demonstrating IAM permission boundaries. We take an in-depth look at controlling access to specific AWS regions using condition keys. Finally, we explain how to use tags to scale permissions management in your account. This session requires you to know the basics of IAM policies.
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Kai Wähner
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems.
Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a presentation.
The slides cover technologies such as Apache Kafka, Apache Spark, Confluent, Databricks, Snowflake, Elasticsearch, AWS Redshift, GCP with Google Bigquery, and Azure Synapse.
Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting (in its original format) and extract value. In this session, learn how to architect and implement a data lake in the AWS Cloud. Learn about best practices as we walk through architectural blueprints.
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
In this session, we discuss architectural principles that helps simplify big data analytics.
We'll apply principles to various stages of big data processing: collect, store, process, analyze, and visualize. We'll disucss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on.
Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Our cofounder Alex Dean gave an introduction to Snowplow and then talked about our roadmap for 2017. Alex touched on several topics including support for more clouds, support for more storage targets, tailoring Snowplow to your industry, more intelligent event sources, moving our batch pipeline to Spark, mega-scale Snowplow and real-time support for Sauna, our decisioning and response system. Presented on 5 April 2017.
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Amazon Elastic File System (Amazon EFS) is a file storage service for Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon EFS is easy to use and provides a simple interface that allows you to create and configure file systems quickly and easily. With Amazon EFS, storage capacity is elastic, growing and shrinking automatically as you add and remove files, so your applications have the storage they need, when they need it.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
SIEM (Security Information and Event Management)Osama Ellahi
In this presentation we cover basic knowledge about siem .
-What is siem
-How It works
-Siem Process
-Siem capabilities
-Some snaps of VARNOIS(Tools that use for getting logs"LOGS aggregation" and then apply some machine algorithms to see about logs that logs are risky OR not).
There are a lot of others vendors also who provided the tools for information and event management.like QRADAR is also one of the best tool by IBM.
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks
At Apple we rely on processing large datasets to power key components of Apple’s largest production services. Spark is continuing to replace and augment traditional MR workloads with its speed and low barrier to entry. Our current analytics infrastructure consists of over an exabyte of storage and close to a million cores. Our footprint is also growing further with the addition of new elastic services for streaming, adhoc and interactive analytics.
In this talk we will cover the challenges of working at scale with tricks and lessons learned managing large multi-tenant clusters. We will also discuss designing and building a self-service elastic analytics platform on Mesos.
Become an IAM Policy Master in 60 Minutes or Less (SEC316-R1) - AWS reInvent ...Amazon Web Services
Are you interested in becoming a IAM policy master and learning about powerful techniques for controlling access to AWS resources? If your answer is “yes,” this session is for you. Join us as we cover the different types of policies and describe how they work together to control access to resources in your account and across your AWS organization. We walk through use cases that help you delegate permission management to developers by demonstrating IAM permission boundaries. We take an in-depth look at controlling access to specific AWS regions using condition keys. Finally, we explain how to use tags to scale permissions management in your account. This session requires you to know the basics of IAM policies.
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Kai Wähner
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems.
Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a presentation.
The slides cover technologies such as Apache Kafka, Apache Spark, Confluent, Databricks, Snowflake, Elasticsearch, AWS Redshift, GCP with Google Bigquery, and Azure Synapse.
Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting (in its original format) and extract value. In this session, learn how to architect and implement a data lake in the AWS Cloud. Learn about best practices as we walk through architectural blueprints.
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
In this session, we discuss architectural principles that helps simplify big data analytics.
We'll apply principles to various stages of big data processing: collect, store, process, analyze, and visualize. We'll disucss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on.
Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Our cofounder Alex Dean gave an introduction to Snowplow and then talked about our roadmap for 2017. Alex touched on several topics including support for more clouds, support for more storage targets, tailoring Snowplow to your industry, more intelligent event sources, moving our batch pipeline to Spark, mega-scale Snowplow and real-time support for Sauna, our decisioning and response system. Presented on 5 April 2017.
Show various use cases and scenarios for Hadoop (tooling) on the cloud and modern data architectures.
•New insights into Analytics and Visualization, to impact the business bottom line.
•Tooling and insights provided by non-traditional approaches to data
•Example a 360 view of the customer,
•Sentiment analysis with social media such as Twitter, traffic patterns, etc.
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
Sparking Data in the Cloud: Data isn’t useful until it’s used to drive decision-making. Companies, like Pinterest, are using Machine Learning to build data-driven recommendation engines and perform advanced cluster analysis. In this talk, Praveen Seluka will cover best practices for running Spark in the cloud, common challenges in iterative design and interactive analysis.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Lucas Jellema
Data Science, Business Intelligence, Data Lake, Machine Learning and AI. Diverse terminology with a common goal: leverage data to realize business value. Through consolidated insight and automated processing, predictions, recommendations and actions. Using visualizations, dashboards, reports, alerts, machine learning models. Based on data. Data retrieved from raw sources into a data lake, wrangled into cleansed, enriched, anonymized and aggregated data sets and turned into business intelligence or used for training machine learning models, that in turn power Smart Applications. This session walks the audience through the start to end data flow on Oracle Autonomous Data Warehouse, Analytics Cloud, Big Data Cloud & Data Integration Platform.
Customer Feedback Analytics for Starbucks Nishant Gandhi
Northeastern University class 7250 Big Data Architecture and Governance Assignment work.
Big Data Project proposal by taking the case study of Starbucks
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
What if your organization could obtain a 360 degree view of the customer across offline, online and social and mobile channels? Attend this webinar with Splunk and Hortonworks and see examples of how marketing, business and operations analysts can reach across disparate data sets in Hadoop to spot new opportunities for up-sell and cross-sell. We'll also cover examples of how to measure buyer sentiment and changes in buyer behavior. Along with best practices on how to use data in Hadoop with Splunk to assign customer influence scores that online, call-center, and retail branches can use to customize more compelling products and promotions.
Big Data at the Speed of Business: Lessons Learned from Leading at the EdgeDataWorks Summit
How do you make big data accessible, usable and valuable for everyone? And mine your data for intelligence in minutes and hours, not weeks and months? What about getting real-time insights from your data, even before you persist and replicate it? In this talk, we’ll examine compelling, real-world examples that offer a blueprint for integrating big data technologies (Splunk, Hadoop, RDBMS, Cassandra, HBase), delivering rapid visibility and insights to IT professionals, data analysts and business users, and that accelerate the adoption of big data in the enterprise.
Open Blueprint for Real-Time Analytics with In-Stream ProcessingGrid Dynamics
As companies continue to invest in big data, their focus is shifting from predictive analytics for reporting and business dashboards to machine learning & AI for real-time intelligent decision-making embedded in software. Many organizations are testing, exploring and piloting applications that automatically promote trending products, adjust prices or respond to alerts raised by intelligent real-time systems.
In her talk, Ms. Victoria Livschitz, founder and CTO of Grid Dynamics, will discuss common business drivers of real-time analytics applications and the emerging platforms for building such applications.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Clickstream & Social Media Analysis using Apache Spark
1. Clickstream & Social Media Analysis
Use cases and examples using Apache Spark
Michael Cutler @ TUMRA – November 2014
2. Hello
About Me
• Early adopter of Hadoop
• Spoke at Hadoop World on
machine learning
• Twitter: @cotdp
TUMRA
We use Data Science and Big Data
technology to help ecommerce
companies understand their
customers and increase sales.
This Talk
• Slide are on Slideshare
• Code example on Github
• Twitter: @tumra
5. Clickstream & Social Media Analysis
Generalised Approach
Mobile/Tablet App
Data
Collection
Data
Processing
Reporting &
Analysis
Web Site
You
People
Social Network
Events Files Tables
6. How has this approach evolved?
Rapidly reducing the ‘time to insight’
pre-Historic Hadoop
• Proprietary & Expensive
• Slow Constrained
Time to Insight
48+ hours
2008 - Hadoop
• Open-source & Inexpensive
• Flexible but complex to use
Time to Insight
hours
2014 - Spark
• Batch, Streaming & Interactive
• Fast & Easy to use
Time to Insight
minutes
7. Weaving a story from a string of activities
Understanding the shoppers journey
PPC long-tail
keyword
Day #0
Opened Email
Newsletter on iPad PPC brand
PPC brand keyword &
signed up email
keyword
Add To Cart
Order
Placed
Day #7 Day #10 Day #13 Day #17
9. It’s all about People & Products
Not just boring log files!
Turn low-level events like “Page Views” into something meaningful
e.g. <Person1234> <viewed-a> <Product:Camera>
Bought a …
Activity & Interactions
Gauging Interest
Measuring the degree of interest a Person has about a Product
e.g. are 10 views for a certain Product a good or bad thing?
Affinities
Either inferred from other Peoples activities, or Product similarity
Properties
Both people and products have properties,
e.g. <Person1234> <is:gender> <Female>
10. People & Product Interactions
Source: Snowplow Analytics
e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
11. That sounds like a Graph …
Use graphs to understand user intent
Interest Graph Visualisation
• Collect user activity data in real-time, not just
clicks but mouse-overs, images, video, social.
• Algorithms identify products, categories and
brands a particular person is interested in.
• Cluster users into ‘neighborhoods’ to infer what to
show to existing and future visitors.
This visualization illustrates just 1% of 6 weeks visitor
activity data. Blue data points are People, Orange
data points are Products.
13. Three reasons Apache Spark is awesome!
Apart from “no more Java Map/Reduce code!!!”
Fast
• In-memory Caching
• DAG execution optimisation
• Easy to use in Scala, Java, Python
Smart
• Machine Learning baked in
• Graph algorithms
• Interactive Shell
Flexible
• Query from Spark SQL
• Streaming
• Batch (file based)
15. Apache Spark
Coexists with your existing Hadoop Infrastructure
Hadoop Filesystem (HDFS)
Apache ZooKeeper
Apache Hive etc.
Map / Reduce
Yarn / Mesos
16. Apache Spark can …
Simple example of Spark SQL used from Scala
Source: Databricks
Go from a SQL query…
… to a trained machine learning
model in three lines of code.
18. Example Architecture
Coexists with your existing Hadoop Infrastructure
Reporting
Dashboard
Hadoop Filesystem (HDFS) NoSQL Store
Apache ZooKeeper
(Cassandra)
Apache Kafka
Analytics
Jobs
19. Social Media Analysis
Converting a low-level event into a meaningful high-level interaction
• A user-interaction from the
Facebook firehose, received as a
real-time stream of JSON
• Streamed into Apache Kafka,
also stored in SequenceFiles
• Modeled into Scala Case Class:
20. Example - Spark (Scala)
Using the Spark (Scala) interface to analyze the data
• Parse JSON
• Extract interesting attributes
• ‘Reduce by Key’ to sum the result
• Print results
21. Example - Spark SQL
Using the Spark SQL interface to analyze the data
• Parse JSON
• Extract interesting attributes,
transform into Case Classes
• ‘Register as table’
• Execute SQL, print results
22. Want to play with awesome tech and data?
We’re hiring! team@tumra.com
Data Engineer
Scala, functional programming,
Hadoop, NoSQL
Sales & Marketing
Experience with SaaS and ecommerce sales