Improving Mobile Payments With Real time Sparkdatamantra
Talk about real world spark streaming implementation for improving mobile payments experience. Presented at Target data meetup at Bangalore by Madhukara Phatak on 22/08/2015.
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Improving Mobile Payments With Real time Sparkdatamantra
Talk about real world spark streaming implementation for improving mobile payments experience. Presented at Target data meetup at Bangalore by Madhukara Phatak on 22/08/2015.
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
Do you really know what the Cloud is?
Cloud Computing is the next big thing in IT industry: Amazon AWS, Google Compute Engine, Rackspace Cloud and Windows Azure are fighting to win the battle of the future.
14
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
Do you really know what the Cloud is?
Cloud Computing is the next big thing in IT industry: Amazon AWS, Google Compute Engine, Rackspace Cloud and Windows Azure are fighting to win the battle of the future.
14
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/coordinating-many-tools-of-big-data/alan-gates
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
When evaluating Open Source Software, or other software of a certain size or complexity, organizations frequently want to conduct a Pilot project, or Proof of Concept (POC). This talk describes a process to reduce the length of the Pilot, by leveraging configurations from performance testing to POC starting configurations.
The two faces of Google Scholar
Opening the academic Pandora’s Box
Why do we call it a big data bibliometric tool?
Drawbacks Google Scholar, Google Scholar Citations, Google Scholar Metrics
Scientists, developers, and many other technologists from many different industries are taking advantage of Amazon Web Services to meet the challenges of the increasing volume, variety, and velocity of digital information. Amazon Web Services offers an end-to-end portfolio of cloud computing resources to help you manage big data by reducing costs, gaining a competitive advantage and increasing the speed of innovation.
In this presentation from a webinar focusing on running Data Analytics on AWS, AWS Technical Evangelist, Ian Massingham, discusses the role that AWS services can play in helping you to derive value from your data. Topics include stream processing with Amazon Kinesis, processing data with Amazon Elastic MapReduce (EMR)and its ecosystem of tools and running large scale data warehouses on AWS with Redshift.
Topics covered in this session:
• Discover how AWS customers are extracting value from Big Data
• Understand the role that AWS services could play in helping you to manage your data
• Learn about running Hadoop on AWS Amazon EMR and its ecosystem of tools for data processing and analysis
See a recording of this webinar on YouTube here: http://youtu.be/ueRarqsCbJM
See past and future webinars in the Journey Through the Cloud series here: http://aws.amazon.com/campaigns/emea/journey/
For a deep dive into specific AWS services, you might also be interested in the Masterclass webinar series, which you can find here: http://aws.amazon.com/campaigns/emea/masterclass/
Collaborated with team members to design and assemble a restaurant management website that have all nearby restaurants list, restaurant details with menu and order food functionality
Achieved admin side functionalities which are allowed to add restaurants and menu to the website and also have authority to approve and block some restaurant as per requirement
Established all features of restaurants’ review and rate with google map for location of restaurants using PHP, HTML, CSS3, JS, MySQL, Bootstrap framework
Implemented agile methodology by dividing project in small releases
Released each user stories after successful testing which maintain quality of project
Introduction to Hive and HCatalog presentation by Mark Grover at NYC HUG. A video of this presentation is available at https://www.youtube.com/watch?v=JGwhfr4qw5s
Big Data and the Climate/Environment domain (vis-a-vis the respective H2020 Societal Challenge) - Opportunities, Challenges and Requirements. As presented and discussed in the public launch of the BigDataEurope project.
Performed analysis on Temperature, Wind Speed, Humidity and Pressure data-sets and implemented decision tree & clustering to predict possibility of rain
Created graphs and plots using algorithms such as k-nearest neighbors, naïve bayes, decision tree and k means clustering
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
Part 3 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation demonstrates how to use Apache Whirr to launch a Hadoop cluster on Amazon EC2--easily.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. This talk offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.
Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that make it easier to get started with Spark and transition themselves to a daily workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.
In this talk, we look at Stitch Fix’s journey, exploring its Spark setup, in-house tools and how they work in synergy with open source frameworks in a cloud environment. There are additional improvements to the infrastructure that help persist information for future use and optimization and we look at how the implementation of Amazon’s EMR FS has helped make it easier for us to read from the S3 source.
Totango is an Analytics platform for Customer Success.
Our data pipeline converts usage information into actionable analytics. The pipeline is managed using Luigi workflow engine, and data transformations are done in Spark.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
In Cassandra Lunch #88, CEO of Anant, Rahul Singh, will discuss how Cadence works on top of Cassandra to provide workflow management at scale and Cadence architecture in the context of SAGA Patterns
Accompanying Blog: Coming Soon!
Accompanying YouTube: https://youtu.be/YPPPM0F0xw0
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Similar to A Tool For Big Data Analysis using Apache Spark (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. ● Ganesha Yadiyala
● Big data consultant at
datamantra.io
● Consult in spark and scala
● ganeshayadiyala@gmail.com
3. Agenda
● Problem Statement
● Business view
● Why spark
● Thinking REST
● Load API
● Transform API
● Machine learning
● Pipeline API
● Save API
4. Problem Statement
Build a generic solution which can be used to do
transformation on data and then analyse it to get useful result
out of it.
5. Business view
● This is an era of big data.
● All companies are trying to get something useful from the
data and solve problems.
● There exists many frameworks in big data but we need a
tool which will leverage most of them and can solve
problem easily.
● So if there is a general solution or tool which can be able
to solve many of these problem that would be a big plus.
6. Why we used spark
There are many big data frameworks out there which can be
used for analysis of data, but we chose to use spark because,
● Capability to handle multiple data source
● Easy binding with the external data
● Good support for machine learning through spark-ml and
spark mllib
7. Thinking REST
To do all this transformation and analysis we provided REST
api because,
● Minimise the coupling between client and server
● Different clients can use the REST api to interact with the
tool.
● Used Akka-http for rest service
8. Akka-http
It is an Actor-based toolkit for interacting with web services
and clients,
● It is also written in scala and it uses same configuration
management library as spark
● It is an actor and future based
10. Rest server design
● Instead of going with spark jobserver we went with our
own rest server
● Once the rest server is started spark context is created
● All the configuration is passed to the spark context
through typesafe during its creation
● Same context is used for all the operation.
11. Loading from different sources
We supported different types of data,
● Csv datasource
● Json datasource
● Parquet datasource
● Xml datasource
12. Loading from different sources
We also supported some of the sources like,
● Mongodb
● Kafka
● JDBC
● Cassandra
13. Transformation
In big data world data which is coming to the system cannot
be used as it is, we may have to transform the data as
needed for the operation
We gave the API’s in REST to do this transformation, which
internally call spark dataframe API’s
14. Example
Some of the transformation we provided is,
● Cast - Cast the datatype of a column
● Filter - filter based on the formula or condition
● Aggregation - Max,min,sum,median etc
● Joins - Joining two datasets
15. Machine learning - spark ml
Spark ml provides higher level API which is built on top of the
dataframe.
● We did not used mllib because that is built on top of the
rdd.
● We provided rest API which will talk to these ML apis
16. Example
Some of the ml apis we provided are,
● Linear regression
● Decision tree (regressor and classifier)
● Ridge regression
● KMeans etc...
17. Challenges in spark ml
● It was very difficult to write generic api because not all the
ml algorithms expect similar inputs
● Not all the apis are documented properly
● Validation on the type of the columns which can be given
to these API are really difficult.
18. Save API
Once the transformation is done or ml gives the output use
may want to save the result. We support,
● text
● json
● parquet
● mongodb
● cassandra etc...
19. Pipeline and scheduling
We also implemented a pipeline api which will pipe all the
loading, transformation or ml apis.
If the user want to run this operation at scheduled time it is
possible through schedule API which we have provided.
20. Summary
No solution will be able to solve all the big data problems, but
we tried to build a tool which is generic enough to write your
own transformation on data, analyse it and we can solve
many of the problems