1
Internship Report
On
“Big Data and Cloud Computing”
Submitted By
Aditya D. Shinde T190304350
Under the guidance
of
Prof. Pradnya Kothawade
DEPARTMENT OF COMPUTER ENGINEERING
GENBA SOPANRAO MOZE COLLEGE OF ENGINEERING,
BALEWADI, PUNE-411041
SAVITRIBAI PHULE PUNE UNIVERSITY
[Year:2023-2024]
2
CERTIFICATE
This is to certify that the “Internship Report” submitted by ADITYA D. SHINDE, is work
done by them and submitted during 2023-2024 academic year, Big Data and Cloud Computing
in degree of BACHELOR OF ENGINEERING IN COMPUTER ENGINEERING ,at YBI
Foundation.
(Prof. Pradnya Kothawade ) (Prof. Rahul Kumar)
Guide Head,
Department ofComputer Engineering Department ofComputer Engineering
(Dr.RatnarajKumar Jambi)
Principal,
Genba Sopanrao Moze College ofEngineering Pune – 45
Place : Pune
Date : 30/04/2024
3
ACKNOWLEDGEMENT
With immense pleasure, we present the Internship Report as part of the curriculum of the
T.E. Computer Engineering. We wish to thank all the people who gave us an unending support
from beginning of the Internship .
We express our sincere and profound thanks to Prof. Pradnya Kothawade our Internship
Guide and our HOD Prof. Rahul Kumar who always stood by us as the helping and guiding
support and all those who have directly or indirectly guided and helped us in the preparation of
the Internship .
.
Aditya D. Shinde
4
TABLE OF CONTENTS
S No. Content Page No.
1.3 Abstract 5
1.4 Internship Place Details 6
1.5 Certificate 7
2.1 Introduction To Internship 8
2.2 Mode of internship 9
2.3 Domain of Internship 11
2.4 Objectives of Internship 15
2.5 Motivation/Scope of Internship 16
2.6 Methodological Details 18
2.7 Research Challenges 20
2.8 Conclusion 23
2.9 References 24
5
Abstract
Big Data is used in decision making process to gain useful insights hidden in the data for business and
engineering. At the same time it presents challenges in processing, cloud computing has helped in
advancement of big data by providing computational, networking and storage capacity. This paper
presents the review, opportunities and challenges of transforming big data using cloud computing
resources.
6
INTERNSHIP PLACE DETAILS
Company background-organization and activities
Name of Company YBI Foundation
Contact number of Company (+91) 9667987711
Name of Director Alok Yadav
Email ID of Supervisor alokyadav@yantrabyte.com
Company Background :
YBI Foundation is a non-profit organization dedicated to empowering youth through education,
entrepreneurship, and community development initiatives. Founded in [Year], our mission is to foster
a generation of young leaders equipped with the skills, resources, and support networks necessary to
thrive in an ever-changing world.
Our vision is to create a world where every young person has access to quality education, economic
opportunities, and a supportive community that nurtures their growth and potential.
Looking ahead, YBI Foundation remains committed to expanding our reach and deepening our
impact. Our future directions include:
• Scaling our education and entrepreneurship programs to reach more youth in underserved
communities.
• Strengthening partnerships with government agencies, corporations, and civil society
organizations to leverage resources and expertise for greater collective impact.
• Embracing technology and digital innovation to enhance the delivery and effectiveness of our
programs, particularly in response to the evolving needs of youth in a rapidly changing world.
7
8
2.1 Introduction To Internship
The volume and information captured from various mobile devices and multimedia by organizations
is increasing every moment and has almost doubled every year. This sheer volume of data generated
can be categorized as structured or unstructured data that cannot be easily loaded into regular
relational databases. This big data requires pre-processing to convert the raw data into clean data set
and made feasible for analysis. Healthcare, finance, engineering, e commerce and various scientific
fields use these data for analysis and decision making. The advancement in data science, data storage
and cloud computing has allowed for storage and mining of big data.
Cloud computing has resulted in increased parallel processing, scalability, accessibility, data security,
virtualization of resources and integration with data storages. Cloud computing has eliminated the
infrastructure cost required to invest in hardware, facilities, utilities or building large data centres.
Cloud infrastructure scales on demand to support fluctuating workloads which has resulted in the
scalability of data produced and consumed by the big data applications. Cloud virtualization can
create virtual platform of server operating system and storage devices to spawn multiple machines at
the same time. This provides a process to share resources and isolation of hardware to increase the
access, management, analysis and computation of the data.
The main objective of this paper is to provide review, opportunities and challenges of big data
applications in cloud computing which requires data to be processed efficiently and also provide some
good design principles.
9
2.2 Mode of Internship
Online Internship: Navigating the Virtual Workspace
In recent years, the rise of technology and connectivity has transformed the landscape of internship
opportunities, paving the way for online internships that transcend geographical boundaries and offer
flexibility in engagement. An online internship, also known as a virtual internship, allows interns to
work remotely from any location with internet access, leveraging digital tools and communication
platforms to collaborate with colleagues, complete projects, and gain valuable hands-on experience in
their chosen field. This section explores the intricacies of online internships, including their benefits,
challenges, and best practices for success.
Benefits of Online Internships:
• Flexibility: Online internships offer flexibility in terms of location and schedule, allowing
interns to work from the comfort of their homes or any other preferred workspace. This
flexibility enables interns to balance their internship commitments with academic studies,
personal responsibilities, or other part-time work.
• Access to Opportunities: Online internships provide access to a wide range of opportunities
across diverse industries and geographical regions, eliminating geographical barriers and
opening doors to internships with organizations located anywhere in the world. This expanded
access increases the likelihood of finding internships aligned with individual interests, skills,
and career goals.
• Cost-Effectiveness: By eliminating the need for commuting or relocating to a different city
for the duration of the internship, online internships can be more cost-effective for interns,
reducing expenses associated with transportation, accommodation, and daily living expenses.
• Enhanced Digital Skills: Engaging in an online internship exposes interns to a variety of
digital tools, collaboration platforms, and communication technologies commonly used in
remote work environments. This hands-on experience enhances interns' digital literacy,
adaptability, and proficiency in navigating virtual workspaces, which are increasingly valued
in today's digital economy.
• Global Networking Opportunities: Online internships enable interns to connect and
collaborate with professionals, mentors, and peers from diverse cultural backgrounds and
geographic locations. Building relationships with professionals across the globe expands
interns' professional networks, fosters cross-cultural understanding, and opens doors to future
career opportunities.
10
Challenges of Online Internships:
• Communication Barriers: In a virtual work environment, communication may be hindered
by factors such as time zone differences, language barriers, and reliance on digital
communication channels. Clear and effective communication becomes paramount to ensure
alignment, collaboration, and understanding among team members.
• Lack of In-Person Interaction: Unlike traditional in-person internships, online internships
lack face-to-face interaction, which may hinder relationship-building, mentorship, and
informal learning opportunities that often occur organically in a physical workplace.
Overcoming this challenge requires proactive efforts to foster virtual connections, build
rapport, and seek mentorship through digital channels.
• Technical Issues: Technical glitches, internet connectivity issues, and software compatibility
problems are common challenges faced in online internships, which can disrupt workflow,
delay project timelines, and hinder productivity. Interns must be prepared to troubleshoot
technical issues independently or seek timely assistance from technical support teams.
• Self-Discipline and Time Management: Working remotely requires a high degree of self-
discipline, organization, and time management skills to stay focused, meet deadlines, and
balance competing priorities effectively. Interns must proactively manage their time, set
realistic goals, and establish routines to maintain productivity and accountability in a virtual
work environment.
Best Practices for Success in Online Internships:
• Establish Clear Expectations: Clarify expectations, goals, and deliverables with supervisors
at the outset of the internship to ensure alignment and understanding of roles and
responsibilities.
• Communicate Proactively: Maintain regular communication with supervisors, colleagues,
and mentors through email, instant messaging, video conferencing, or project management
platforms to stay informed, seek clarification, and provide updates on progress.
• Embrace Digital Collaboration Tools: Familiarize yourself with digital collaboration tools
such as Slack, Microsoft Teams, Zoom, Trello, or Asana to facilitate communication, project
management, and collaboration with remote teams.
• Seek Feedback and Mentorship: Actively seek feedback from supervisors and mentors on
your work, progress, and areas for improvement. Establish regular check-ins or virtual
meetings to discuss goals, challenges, and opportunities for growth.
• Stay Organized and Productive: Create a dedicated workspace, establish a daily routine, and
prioritize tasks to maintain focus, productivity, and work-life balance while working remotely.
• Network and Build Relationships: Take advantage of virtual networking opportunities,
industry webinars, and online communities to expand your professional network, engage with
peers, and build meaningful relationships within your field of interest.
11
2.3 Domain of Internship
Domain of Internship: Big Data And Cloud Computing
BIG DATA
Data which is huge, difficult to store, manage and analyze through traditional databases is termed as
“Big Data”. It requires a scalable architecture for their efficient storage, manipulation, and analysis.
Such massive volume of data comes from myriad sources: smartphones and social media posts;
sensors, such as traffic signals and utility meters; point-of-sale terminals; consumer wearables such
as fit meters and electronic health records. Various technologies are integrated to discover hidden
values from these varied, complex data and transform it into actionable insight, improved decision
making, and competitive advantage. The characteristics of big data are:
1.) Volume – Refers to incredible amount of data generated each second from different sources such
as social media, cell phones, cars, credit cards, M2M sensors, photographs and videos which
would allow users to data mine the hidden information and patterns found in them.
2.) Velocity - Refers to the speed at which data is being generated, transferred, collected and
analyzed. Data generated at an ever-accelerating pace must be analyzed and the speed of
transmission, and access to the data must remain instantaneous to allow for real-time access to
different applications that are dependent on these data.
3.) Variety – Refers to data generated in different formats either in structured or unstructured format.
Structured data such as name, phone number, address, financials, etc. can be organized within the
columns of a database. This type of data is relatively easy to enter, store, query, and analyze.
Unstructured data which contributes to 80% of today’s world data are more difficult to sort and
extract value. Unstructured data include text messages, audio, blogs, photos, video sequences,
social media updates, log files, machine and sensor data.
4.) Value – Refers to the hidden value discovered from the data for decision making. Substantial
value can be found in big data, including understanding your customers better, targeting them
accordingly, optimizing processes, and improving machine or business performance.
5.) Veracity - Refers to the quality and reliability of the data source. Its importance is in the context
and the meaning it adds to the analysis. Knowledge of the data's veracity in turn helps in better
understanding the risks associated with analysis and business decisions based on data set.
12
V’s of Big Data
CLOUD COMPUTING
Cloud computing delivers computing services such as servers, storage, databases,
networking, software, analytics and intelligence over the internet for faster innovation,
flexible resources, heavy computation, parallel data processing and economies of scale. It
empowers the organizations to concentrate on core business by completely abstracting
computation, storage and network resources to workloads as needed and tap into an
abundance of prebuilt services. Figure 3 shows the differences between on-premise and
cloud services. It shows the services offered by each computing layer and differences
between them.
13
User Management Cloud Management
The array of available cloud computing services is vast, but most fall into one of the
following categories:
SaaS: Software as a Service
Software as a service represents the largest cloud market and most commonly used business
option in cloud services. SaaS delivers applications to the users over the internat.
Applications that are delivered through SaaS are maintained by third-party vendors and
interfaces are accessed by the client through the browser. Since most of SaaS applications run
directly from a browser, it eliminates the need for the client to download or install any
software. In SaaS vendor manages applications, runtime, data, middleware, OS,
virtualization, servers, storage and networking which makes it easy for enterprises to
streamline their maintenance and support.
PaaS: Platform as a Service
Platform as a Service model provides hardware and software tools over the internet which
are used by developers to build customized applications. PaaS makes the development,
testing and deployment of applications quick, simple and cost-effective. This model allows
business to design and create applications that are integrated into PaaS software components
while the enterprise operations or thirty-party providers manage OS, virtualization, servers,
storages, networking and the PaaS software itself. These applications are scalable and highly
available since they have cloud characteristics.
IaaS: Infrastructure as a Service
Infrastructure as a Service cloud computing model provides self-servicing platform for
accessing, monitoring and managing remote data center infrastructures such as compute,
14
storage and networking services to organizations through virtualization technology. IaaS
users are responsible for managing applications, data, runtime, middleware, and OS while
providers still manage virtualization, servers, hard drives, storage, and networking. IaaS
provides same capabilities as data centers without having to maintain them physically [6].
Figure 4 represents the different cloud computing services been offered.
Primary Cloud Computing Services
15
2.4 Objectives of Internship
Objectives of Data Science and Data Analytics Internship
Embarking on a data science and data analytics internship presents a unique opportunity for
individuals to gain practical experience, develop technical skills, and apply theoretical knowledge to
real-world data challenges. The objectives of such an internship are multifaceted, encompassing
professional development, hands-on learning, and contributions to organizational goals. This section
delineates the key objectives of a data science and data analytics internship, elucidating their
significance and impact on the intern's growth and learning journey.
• Hands-on Experience: The primary objective of a data science and data analytics internship
is to provide interns with hands-on experience in working with real-world data sets, tools, and
methodologies. By engaging in data collection, cleaning, analysis, and visualization tasks,
interns gain practical exposure to the entire data lifecycle, honing their technical skills and
problem-solving abilities in a professional setting.
• Application of Theoretical Knowledge: Internships offer a platform for interns to apply
theoretical concepts and techniques learned in academic courses to practical data challenges.
By working on projects that require data manipulation, statistical analysis, and machine
learning model development, interns reinforce their understanding of core data science and
data analytics principles and gain insights into their real-world applications.
• Skill Development: Internships serve as a catalyst for skill development, enabling interns to
enhance their proficiency in programming languages such as Python, R, or SQL, as well as
data analysis tools and libraries such as Pandas, NumPy, and TensorFlow. Through hands-on
projects and mentorship from experienced professionals, interns sharpen their data
manipulation, statistical analysis, and machine learning skills, preparing them for future roles
in the field.
• Exposure to Industry Practices: Internships provide interns with exposure to industry best
practices, tools, and workflows commonly used in data science and data analytics roles. By
working alongside seasoned professionals and collaborating on real-world projects, interns
gain insights into how data-driven decisions are made within organizations, learning about
data governance, project management, and ethical considerations in data science.
• Contribution to Organizational Goals: Interns play a valuable role in contributing to
organizational goals through their internship projects. Whether it's building predictive models
to improve business forecasting, analyzing customer behavior to optimize marketing
strategies, or automating data pipelines for enhanced efficiency, interns have the opportunity
to make tangible contributions that drive business outcomes and add value to the organization.
• Networking and Professional Development: Internships provide interns with networking
opportunities and exposure to industry professionals, mentors, and peers. By building
relationships, seeking mentorship, and participating in team collaborations, interns expand
their professional network, gain valuable insights into career pathways, and lay the foundation
for future opportunities in the field.
16
2.5 Motivation/Scope of Internship
Undertaking a data science and data analytics internship is driven by various motivations that align
with career aspirations, academic interests, and personal growth objectives. The scope of such an
internship encompasses a range of learning opportunities, project engagements, and contributions to
organizational goals. This section explores the motivation and scope of a data science and data
analytics internship, shedding light on the rationale, objectives, and potential impact of the internship
experience.
Motivation:
• Professional Growth: A primary motivation for pursuing a data science and data analytics
internship is to foster professional growth and development. Interns seek to gain hands-on
experience in working with real-world data sets, applying analytical techniques, and
leveraging data-driven insights to solve business problems. This experience enhances their
skills, expands their knowledge base, and prepares them for future roles in the field.
• Skill Enhancement: Internships offer opportunities for interns to enhance their technical
skills in areas such as programming languages (e.g., Python, R, SQL), statistical analysis,
machine learning algorithms, data visualization, and big data technologies. By engaging in
data manipulation, modeling, and interpretation tasks, interns strengthen their analytical
capabilities and proficiency in utilizing data science tools and techniques.
• Exploration of Career Pathways: Internships serve as a platform for interns to explore
different career pathways within the field of data science and data analytics. Whether it's
focusing on data engineering, machine learning, business intelligence, or data visualization,
interns have the opportunity to gain exposure to diverse roles, industries, and applications of
data science, helping them identify areas of interest and specialization.
• Networking and Mentorship: Internships provide avenues for interns to network with
professionals in the data science community and seek mentorship from experienced
practitioners. By connecting with industry professionals, attending networking events, and
participating in mentorship programs, interns can gain valuable insights, advice, and guidance
that can inform their career decisions and trajectory in the field.
• Contribution to Meaningful Projects: Interns are motivated by the opportunity to contribute
to meaningful projects that have real-world impact and add value to the organization. Whether
it's analyzing customer data to improve marketing strategies, building predictive models to
optimize business processes, or conducting exploratory analysis to uncover insights, interns
seek to make tangible contributions that drive business outcomes and innovation.
17
Scope:
• Learning Objectives: The scope of a data science and data analytics internship is defined by
its learning objectives, which outline the specific knowledge, skills, and competencies that
interns aim to acquire or develop during the internship period. These objectives may include
mastering data manipulation techniques, understanding machine learning algorithms, or
gaining proficiency in data visualization tools.
• Project Engagement: Internship projects within the scope of data science and data analytics
vary in complexity and focus, ranging from exploratory data analysis to predictive modeling
and data-driven decision-making. The scope of the internship project defines the tasks,
deliverables, and outcomes that interns are expected to achieve, guiding their activities and
responsibilities throughout the internship.
• Industry Focus: The scope of the internship may also be influenced by the industry or sector
in which the internship is situated. Whether it's technology, finance, healthcare, e-commerce,
or government, the industry focus of the internship determines the context, challenges, and
opportunities encountered during the internship journey, shaping the scope of projects and
applications of data science techniques.
• Data Sets and Tools: Internship scope includes working with diverse data sets, ranging from
structured databases to unstructured text and image data. Interns engage with a variety of data
science tools and technologies, including programming languages (e.g., Python, R), statistical
packages (e.g., Pandas, NumPy), machine learning libraries (e.g., scikit-learn, TensorFlow),
and data visualization tools (e.g., Matplotlib, Tableau), depending on the scope of their
projects and organizational requirements.
• Collaboration and Communication: Interns collaborate with cross-functional teams,
stakeholders, and mentors to accomplish project goals and deliverables within the scope of the
internship. Effective communication, teamwork, and project management skills are essential
for navigating project scope, addressing challenges, and ensuring alignment with
organizational objectives.
18
2.6 Methodological Details
Methodological Details on Python with Data Structures and Algorithms
Python, as a versatile programming language, is widely used for implementing data structures and
algorithms due to its simplicity, readability, and extensive library support. This section provides
methodological details on how Python is utilized for implementing data structures and algorithms,
including the choice of data structures, algorithm design principles, and practical considerations.
1. Choice of Data Structures:
Python offers built-in and library-supported data structures that are essential for implementing
algorithms efficiently. Commonly used data structures include:
• Lists: Dynamic arrays that can store elements of different data types.
• Tuples: Immutable sequences typically used for heterogeneous data.
• Sets: Unordered collections of unique elements for fast membership testing.
• Dictionaries: Key-value pairs for efficient data retrieval based on keys.
• Arrays: Homogeneous collections of elements with fixed size for numerical computations.
• Linked Lists: Linear data structures composed of nodes, each containing a data element and a
reference to the next node.
• Stacks and Queues: Abstract data types for Last-In-First-Out (LIFO) and First-In-First-Out
(FIFO) operations, respectively.
• Trees: Hierarchical data structures consisting of nodes connected by edges, with examples
including binary trees, binary search trees, and heaps.
• Graphs: Non-linear data structures composed of nodes and edges, representing relationships
between entities.
2. Algorithm Design Principles:
When implementing algorithms in Python, adhering to established design principles enhances code
readability, performance, and maintainability. Key algorithm design principles include:
• Divide and Conquer: Break down complex problems into smaller, more manageable
subproblems, solve them recursively, and combine their solutions.
• Dynamic Programming: Store solutions to overlapping subproblems in a table to avoid
redundant computations in recursive algorithms.
• Greedy Algorithms: Make locally optimal choices at each step with the hope of finding a
globally optimal solution.
• Backtracking: Systematically explore all possible solutions to a problem by recursively
building candidates and rejecting those that fail to satisfy constraints.
19
• Sorting and Searching: Utilize efficient sorting algorithms (e.g., Merge Sort, Quick Sort)
and searching techniques (e.g., Binary Search) for data manipulation and retrieval tasks.
3. Practical Considerations:
In addition to selecting appropriate data structures and algorithms, practical considerations ensure
efficient implementation and optimization of Python code:
• Time and Space Complexity Analysis: Analyze the time and space complexity of algorithms
to assess their efficiency and scalability, guiding the selection of optimal solutions for specific
use cases.
• Code Optimization: Optimize Python code for performance by minimizing redundant
computations, reducing memory overhead, and leveraging built-in functions and library
modules.
• Unit Testing: Validate the correctness and functionality of Python implementations through
unit testing, ensuring that algorithms produce expected outputs for various inputs and edge
cases.
• Error Handling: Implement robust error handling mechanisms to handle exceptions, edge
cases, and unexpected behaviors gracefully, enhancing the reliability and robustness of Python
code.
4. Libraries and Frameworks:
Python offers a rich ecosystem of libraries and frameworks for data structures and algorithms,
including:
1. NumPy: For numerical computations and array manipulation.
2. Pandas: For data manipulation and analysis with tabular data structures.
3. SciPy: For scientific computing and optimization algorithms.
4. scikit-learn: For machine learning algorithms and model training.
5. NetworkX: For graph algorithms and network analysis.
6. itertools: For efficient iteration and combination of elements.
heapq: For heap operations and priority queue implementations.
20
2.7 Research Challenges
Big data are huge data sets that are very complex. The data generated is highly dynamic and this
further adds to its complexity. The raw data must be processed in order to extract value from it.
This gives rise to challenges in processing big data and business issues associated with it. Volume
of the data generated worldwide is growing exponentially. Almost all the industries such as
healthcare, automobile, financial, transportation etc rely on this data for improving their business
and strategies. For example, Airlines does millions of transactions per day and have established
data warehouses to store data to take advantages of machine learning techniques to get the insight
of data which would help in the business strategies. Public administration sector also uses
information patterns from data generated from different age levels of population to increase the
productivity. Also, many of the scientific fields have become data driven and probe into the
knowledge discovered from these data.
Cloud computing has been used as a standard solution for handling and processing big data.
Despite all the advantages of integration between big data and cloud computing, there are several
challenges in data transmission, data storage, data transformation, data quality, privacy,
governance [14, 15].
Data Transmission
Data sets are growing exponentially. Along with the size, the frequency at which these real-time
data are transmitted over the communication networks has also increased. Healthcare professions
exchange health information such as high-definition medical images that are transmitted
electronically while some of the scientific applications may have to transmit terabytes of data
files that may take longer to traverse the network. In case of streaming applications, the correct
sequence of the actual data packets is as critical as the transmission speed. Cloud data stores are
used for data storage however, network bandwidth, latency, throughput and security poses
challenges.
Data Acquisition and Storage
Data acquisition is the process of collecting data from disparate sources, filtering, and cleansing
data before it can be stored in any data warehouses or storage systems. While acquiring big data,
the main characteristics that pose a challenge are the sheer volume, greater velocity, variety of the
big data. This demands more adaptable gathering, filtering, and cleaning algorithms that ensure
that data are acquired in more time-efficient manner.
Data once acquired, needs to be stored in big capacity data stores which must provide access to
these data in a reliable way. Currently there are Direct Attached Storage (DAS) and Network
Attached Storage (NAS) storage technologies.
Data Curation
It refers to the active and ongoing management of data through its entire lifecycle from creation
21
or ingestion to when it is archived or becomes obsolete and is deleted. During this process, data
passes through various phases of transformation to ensure that data is securely stored and is
retrievable. Organizations must invest in right people and provide them with right tools to curate
data. Such an investment in the data curation will lead to better quantification of high-quality
data.
Scalability
Scalability refers to the ability to provide resources to meet business needs in an appropriate
way. It is a planned level of capacity that can grow as needed. It is mainly manual and is static.
Most of the big data systems must be elastic to handle data changes. At the platform level there
is vertical and horizontal scalability. As the number of cloud users and data increases rapidly, it
becomes a challenge to exponentially scale the cloud’s ability in order to provide storage and
process too many individuals who are connected to the cloud at the same time.
Elasticity
It refers to the cloud’s ability to reduce operational cost while ensuring optimal performance
regardless of computational workloads. Elasticity accommodates to data load variations using
replication, migration and resizing techniques all in a real-time without service disruption. Most
of these are manual instead being automated.
Availability
Availability refers to on demand availability of the systems to authorized users. One of the key
aspects of cloud providers is to allow users to access one or more data services in short time. As
the business models evolve, it would lead to rising demand for more real time system availability.
Data integrity
Data Integrity refers to modification of data only by the authorized user in order to prevent
misuse. Cloud based applications does allow its users to store and manage their data in cloud data
centres, however these applications must maintain data integrity. Since the users may not be able
to physically access the data, the cloud should provide mechanisms to check for the integrity of
data.
Security and Privacy
Maintaining the security of the data stored in the cloud is very important. Sensitive and personal
information that is kept in the cloud should be defined as being for internal use only, not to be
shared with third parties. This would be a major concern when providing personalized and
location-based services as access to personal information are required to produce relevant results.
Each operation such as transmitting data over network, interconnecting the systems over network
or mapping virtual machines to their respective physical machines must be done in a secured way.
Heterogeneity
Big data is vast and diverse. Cloud computing systems need to deal with different formats
structures, semi-structured and unstructured data coming from various sources. Documents,
photos, audio, videos and other unstructured data can be difficult to search and analyse. Having to
combine all the unstructured data and reconcile it so that it can be used to create reports can be
22
incredibly difficult in real time.
Data Governance and Compliance
Data governance specifies the exercise of control and authority over the way data needs to be
handled and accountabilities of individuals when achieving business objectives. Data policies
must be defined on the data format that needs to be stored, different constraint models that limits
the access to underlying data. Defining the stable data policies in the face of increasing data size
and demand for faster and better data management technology is not an easy task and its policies
could lead to counter productiveness.
Data Uploading
It refers to the ease with which massive data sets can be uploaded to the cloud. Data is usually
been uploaded through internet. The speed at which data is uploaded in turn depends network
bandwidth and security. This calls for improvement and efficient data uploading algorithms to
minimize upload times and provide a secure way to transfer data onto the cloud.
Data Recovery
It refers to the procedures and techniques by which the data can be reverted to its original state in
scenarios such as data loss due to corruption or virus attack. Since periodic backups of petabytes
of data is time consuming and costlier, it is necessary to identify a subset of data valuable to the
organization for backup. If this subset of data is lost or corrupted, it take weeks to rebuild the lost
data at these huge scales and result in more downtime for the users.
Data Visualization
Data Visualization is a quick and easy way to represent complex things graphically for better
intuition and understanding. It needs to recognize various patterns and correlations hidden under
massive data. Structured data can be represented in traditional graphical ways, whereas it is
difficult to visualize high diversity, uncertain semi-structured and unstructured big data in real-
time. In order to cope with such large dynamic data, there is a need for immense parallelization
which is a challenge in visualization.
23
2.8 Conclusion
In the big data era of innovation and competition driven by advancements in cloud computing
has resulted in discovering hidden knowledge from the data. In this paper we have given an
overview of big data applications in cloud computing and its challenges in storing,
transformation, processing data and some good design principles which could lead to further
research.
24
2.9 References
1. Konstantinou, I., Angelou, E., Boumpouka, C., Tsoumakos, D., & Koziris, N. (2011,
October). On the elasticity of nosql databases over cloud management platforms. In
Proceedings of the 20th ACM international conference on Information and knowledge
management (pp. 2385-2388). ACM.
2. Labrinidis, Alexandros, and Hosagrahar V. Jagadish. "Challenges and opportunities with big data."
Proceedings of the VLDB Endowment 5.12 (2012): 2032-2033.
3. Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data
Eng. Bull, 32(1), 3-12.
4. Luhn, H. P. (1958). A business intelligence system. IBM Journal of Research and
Development, 2(4), 314-319
5. Sivarajah, Uthayasankar, et al. "Critical analysis of Big Data challenges and analytical
methods."
Journal of Business Research 70 (2017): 263-286.
6. https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-choose/
7. Kavis, Michael J. Architecting the cloud: design decisions for cloud computing service
models (SaaS, PaaS, and IaaS). John Wiley & Sons, 2014.
8. https://www.ripublication.com/ijaer17/ijaerv12n17_89.pdf
9. Sakr, S. & Gaber, M.M., 2014. Large Scale and big data: Processing and Management
Auerbach, ed.
10. Ji, Changqing, et al. "Big data processing in cloud computing environments." 2012 12th
international symposium on pervasive systems, algorithms and networks. IEEE, 2012.
11. Han, J., Haihong, E., Le, G., & Du, J. (2011, October). Survey on nosql database. In
Pervasive Computing and Applications (ICPCA), 2011 6th International Conference on (pp.
363-366). IEEE.
12. Zhang, L. et al., 2013. Moving big data to the cloud. INFOCOM, 2013 Proceedings IEEE,
pp.405–409
13. Fernández, Alberto, et al. "Big Data with Cloud Computing: an insight on the computing
environment, MapReduce, and programming frameworks." Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 4.5 (2014): 380-409.
14. http://acme.able.cs.cmu.edu/pubs/uploads/pdf/IoTBD_2016_10.pdf
25
15. Xiaofeng, Meng, and Chi Xiang. "Big data management: concepts, techniques and challenges
[J]."
Journal of computer research and development 1.98 (2013): 146-169.

Internship Report].pdf iiwmoosmsosmshkssmk

  • 1.
    1 Internship Report On “Big Dataand Cloud Computing” Submitted By Aditya D. Shinde T190304350 Under the guidance of Prof. Pradnya Kothawade DEPARTMENT OF COMPUTER ENGINEERING GENBA SOPANRAO MOZE COLLEGE OF ENGINEERING, BALEWADI, PUNE-411041 SAVITRIBAI PHULE PUNE UNIVERSITY [Year:2023-2024]
  • 2.
    2 CERTIFICATE This is tocertify that the “Internship Report” submitted by ADITYA D. SHINDE, is work done by them and submitted during 2023-2024 academic year, Big Data and Cloud Computing in degree of BACHELOR OF ENGINEERING IN COMPUTER ENGINEERING ,at YBI Foundation. (Prof. Pradnya Kothawade ) (Prof. Rahul Kumar) Guide Head, Department ofComputer Engineering Department ofComputer Engineering (Dr.RatnarajKumar Jambi) Principal, Genba Sopanrao Moze College ofEngineering Pune – 45 Place : Pune Date : 30/04/2024
  • 3.
    3 ACKNOWLEDGEMENT With immense pleasure,we present the Internship Report as part of the curriculum of the T.E. Computer Engineering. We wish to thank all the people who gave us an unending support from beginning of the Internship . We express our sincere and profound thanks to Prof. Pradnya Kothawade our Internship Guide and our HOD Prof. Rahul Kumar who always stood by us as the helping and guiding support and all those who have directly or indirectly guided and helped us in the preparation of the Internship . . Aditya D. Shinde
  • 4.
    4 TABLE OF CONTENTS SNo. Content Page No. 1.3 Abstract 5 1.4 Internship Place Details 6 1.5 Certificate 7 2.1 Introduction To Internship 8 2.2 Mode of internship 9 2.3 Domain of Internship 11 2.4 Objectives of Internship 15 2.5 Motivation/Scope of Internship 16 2.6 Methodological Details 18 2.7 Research Challenges 20 2.8 Conclusion 23 2.9 References 24
  • 5.
    5 Abstract Big Data isused in decision making process to gain useful insights hidden in the data for business and engineering. At the same time it presents challenges in processing, cloud computing has helped in advancement of big data by providing computational, networking and storage capacity. This paper presents the review, opportunities and challenges of transforming big data using cloud computing resources.
  • 6.
    6 INTERNSHIP PLACE DETAILS Companybackground-organization and activities Name of Company YBI Foundation Contact number of Company (+91) 9667987711 Name of Director Alok Yadav Email ID of Supervisor alokyadav@yantrabyte.com Company Background : YBI Foundation is a non-profit organization dedicated to empowering youth through education, entrepreneurship, and community development initiatives. Founded in [Year], our mission is to foster a generation of young leaders equipped with the skills, resources, and support networks necessary to thrive in an ever-changing world. Our vision is to create a world where every young person has access to quality education, economic opportunities, and a supportive community that nurtures their growth and potential. Looking ahead, YBI Foundation remains committed to expanding our reach and deepening our impact. Our future directions include: • Scaling our education and entrepreneurship programs to reach more youth in underserved communities. • Strengthening partnerships with government agencies, corporations, and civil society organizations to leverage resources and expertise for greater collective impact. • Embracing technology and digital innovation to enhance the delivery and effectiveness of our programs, particularly in response to the evolving needs of youth in a rapidly changing world.
  • 7.
  • 8.
    8 2.1 Introduction ToInternship The volume and information captured from various mobile devices and multimedia by organizations is increasing every moment and has almost doubled every year. This sheer volume of data generated can be categorized as structured or unstructured data that cannot be easily loaded into regular relational databases. This big data requires pre-processing to convert the raw data into clean data set and made feasible for analysis. Healthcare, finance, engineering, e commerce and various scientific fields use these data for analysis and decision making. The advancement in data science, data storage and cloud computing has allowed for storage and mining of big data. Cloud computing has resulted in increased parallel processing, scalability, accessibility, data security, virtualization of resources and integration with data storages. Cloud computing has eliminated the infrastructure cost required to invest in hardware, facilities, utilities or building large data centres. Cloud infrastructure scales on demand to support fluctuating workloads which has resulted in the scalability of data produced and consumed by the big data applications. Cloud virtualization can create virtual platform of server operating system and storage devices to spawn multiple machines at the same time. This provides a process to share resources and isolation of hardware to increase the access, management, analysis and computation of the data. The main objective of this paper is to provide review, opportunities and challenges of big data applications in cloud computing which requires data to be processed efficiently and also provide some good design principles.
  • 9.
    9 2.2 Mode ofInternship Online Internship: Navigating the Virtual Workspace In recent years, the rise of technology and connectivity has transformed the landscape of internship opportunities, paving the way for online internships that transcend geographical boundaries and offer flexibility in engagement. An online internship, also known as a virtual internship, allows interns to work remotely from any location with internet access, leveraging digital tools and communication platforms to collaborate with colleagues, complete projects, and gain valuable hands-on experience in their chosen field. This section explores the intricacies of online internships, including their benefits, challenges, and best practices for success. Benefits of Online Internships: • Flexibility: Online internships offer flexibility in terms of location and schedule, allowing interns to work from the comfort of their homes or any other preferred workspace. This flexibility enables interns to balance their internship commitments with academic studies, personal responsibilities, or other part-time work. • Access to Opportunities: Online internships provide access to a wide range of opportunities across diverse industries and geographical regions, eliminating geographical barriers and opening doors to internships with organizations located anywhere in the world. This expanded access increases the likelihood of finding internships aligned with individual interests, skills, and career goals. • Cost-Effectiveness: By eliminating the need for commuting or relocating to a different city for the duration of the internship, online internships can be more cost-effective for interns, reducing expenses associated with transportation, accommodation, and daily living expenses. • Enhanced Digital Skills: Engaging in an online internship exposes interns to a variety of digital tools, collaboration platforms, and communication technologies commonly used in remote work environments. This hands-on experience enhances interns' digital literacy, adaptability, and proficiency in navigating virtual workspaces, which are increasingly valued in today's digital economy. • Global Networking Opportunities: Online internships enable interns to connect and collaborate with professionals, mentors, and peers from diverse cultural backgrounds and geographic locations. Building relationships with professionals across the globe expands interns' professional networks, fosters cross-cultural understanding, and opens doors to future career opportunities.
  • 10.
    10 Challenges of OnlineInternships: • Communication Barriers: In a virtual work environment, communication may be hindered by factors such as time zone differences, language barriers, and reliance on digital communication channels. Clear and effective communication becomes paramount to ensure alignment, collaboration, and understanding among team members. • Lack of In-Person Interaction: Unlike traditional in-person internships, online internships lack face-to-face interaction, which may hinder relationship-building, mentorship, and informal learning opportunities that often occur organically in a physical workplace. Overcoming this challenge requires proactive efforts to foster virtual connections, build rapport, and seek mentorship through digital channels. • Technical Issues: Technical glitches, internet connectivity issues, and software compatibility problems are common challenges faced in online internships, which can disrupt workflow, delay project timelines, and hinder productivity. Interns must be prepared to troubleshoot technical issues independently or seek timely assistance from technical support teams. • Self-Discipline and Time Management: Working remotely requires a high degree of self- discipline, organization, and time management skills to stay focused, meet deadlines, and balance competing priorities effectively. Interns must proactively manage their time, set realistic goals, and establish routines to maintain productivity and accountability in a virtual work environment. Best Practices for Success in Online Internships: • Establish Clear Expectations: Clarify expectations, goals, and deliverables with supervisors at the outset of the internship to ensure alignment and understanding of roles and responsibilities. • Communicate Proactively: Maintain regular communication with supervisors, colleagues, and mentors through email, instant messaging, video conferencing, or project management platforms to stay informed, seek clarification, and provide updates on progress. • Embrace Digital Collaboration Tools: Familiarize yourself with digital collaboration tools such as Slack, Microsoft Teams, Zoom, Trello, or Asana to facilitate communication, project management, and collaboration with remote teams. • Seek Feedback and Mentorship: Actively seek feedback from supervisors and mentors on your work, progress, and areas for improvement. Establish regular check-ins or virtual meetings to discuss goals, challenges, and opportunities for growth. • Stay Organized and Productive: Create a dedicated workspace, establish a daily routine, and prioritize tasks to maintain focus, productivity, and work-life balance while working remotely. • Network and Build Relationships: Take advantage of virtual networking opportunities, industry webinars, and online communities to expand your professional network, engage with peers, and build meaningful relationships within your field of interest.
  • 11.
    11 2.3 Domain ofInternship Domain of Internship: Big Data And Cloud Computing BIG DATA Data which is huge, difficult to store, manage and analyze through traditional databases is termed as “Big Data”. It requires a scalable architecture for their efficient storage, manipulation, and analysis. Such massive volume of data comes from myriad sources: smartphones and social media posts; sensors, such as traffic signals and utility meters; point-of-sale terminals; consumer wearables such as fit meters and electronic health records. Various technologies are integrated to discover hidden values from these varied, complex data and transform it into actionable insight, improved decision making, and competitive advantage. The characteristics of big data are: 1.) Volume – Refers to incredible amount of data generated each second from different sources such as social media, cell phones, cars, credit cards, M2M sensors, photographs and videos which would allow users to data mine the hidden information and patterns found in them. 2.) Velocity - Refers to the speed at which data is being generated, transferred, collected and analyzed. Data generated at an ever-accelerating pace must be analyzed and the speed of transmission, and access to the data must remain instantaneous to allow for real-time access to different applications that are dependent on these data. 3.) Variety – Refers to data generated in different formats either in structured or unstructured format. Structured data such as name, phone number, address, financials, etc. can be organized within the columns of a database. This type of data is relatively easy to enter, store, query, and analyze. Unstructured data which contributes to 80% of today’s world data are more difficult to sort and extract value. Unstructured data include text messages, audio, blogs, photos, video sequences, social media updates, log files, machine and sensor data. 4.) Value – Refers to the hidden value discovered from the data for decision making. Substantial value can be found in big data, including understanding your customers better, targeting them accordingly, optimizing processes, and improving machine or business performance. 5.) Veracity - Refers to the quality and reliability of the data source. Its importance is in the context and the meaning it adds to the analysis. Knowledge of the data's veracity in turn helps in better understanding the risks associated with analysis and business decisions based on data set.
  • 12.
    12 V’s of BigData CLOUD COMPUTING Cloud computing delivers computing services such as servers, storage, databases, networking, software, analytics and intelligence over the internet for faster innovation, flexible resources, heavy computation, parallel data processing and economies of scale. It empowers the organizations to concentrate on core business by completely abstracting computation, storage and network resources to workloads as needed and tap into an abundance of prebuilt services. Figure 3 shows the differences between on-premise and cloud services. It shows the services offered by each computing layer and differences between them.
  • 13.
    13 User Management CloudManagement The array of available cloud computing services is vast, but most fall into one of the following categories: SaaS: Software as a Service Software as a service represents the largest cloud market and most commonly used business option in cloud services. SaaS delivers applications to the users over the internat. Applications that are delivered through SaaS are maintained by third-party vendors and interfaces are accessed by the client through the browser. Since most of SaaS applications run directly from a browser, it eliminates the need for the client to download or install any software. In SaaS vendor manages applications, runtime, data, middleware, OS, virtualization, servers, storage and networking which makes it easy for enterprises to streamline their maintenance and support. PaaS: Platform as a Service Platform as a Service model provides hardware and software tools over the internet which are used by developers to build customized applications. PaaS makes the development, testing and deployment of applications quick, simple and cost-effective. This model allows business to design and create applications that are integrated into PaaS software components while the enterprise operations or thirty-party providers manage OS, virtualization, servers, storages, networking and the PaaS software itself. These applications are scalable and highly available since they have cloud characteristics. IaaS: Infrastructure as a Service Infrastructure as a Service cloud computing model provides self-servicing platform for accessing, monitoring and managing remote data center infrastructures such as compute,
  • 14.
    14 storage and networkingservices to organizations through virtualization technology. IaaS users are responsible for managing applications, data, runtime, middleware, and OS while providers still manage virtualization, servers, hard drives, storage, and networking. IaaS provides same capabilities as data centers without having to maintain them physically [6]. Figure 4 represents the different cloud computing services been offered. Primary Cloud Computing Services
  • 15.
    15 2.4 Objectives ofInternship Objectives of Data Science and Data Analytics Internship Embarking on a data science and data analytics internship presents a unique opportunity for individuals to gain practical experience, develop technical skills, and apply theoretical knowledge to real-world data challenges. The objectives of such an internship are multifaceted, encompassing professional development, hands-on learning, and contributions to organizational goals. This section delineates the key objectives of a data science and data analytics internship, elucidating their significance and impact on the intern's growth and learning journey. • Hands-on Experience: The primary objective of a data science and data analytics internship is to provide interns with hands-on experience in working with real-world data sets, tools, and methodologies. By engaging in data collection, cleaning, analysis, and visualization tasks, interns gain practical exposure to the entire data lifecycle, honing their technical skills and problem-solving abilities in a professional setting. • Application of Theoretical Knowledge: Internships offer a platform for interns to apply theoretical concepts and techniques learned in academic courses to practical data challenges. By working on projects that require data manipulation, statistical analysis, and machine learning model development, interns reinforce their understanding of core data science and data analytics principles and gain insights into their real-world applications. • Skill Development: Internships serve as a catalyst for skill development, enabling interns to enhance their proficiency in programming languages such as Python, R, or SQL, as well as data analysis tools and libraries such as Pandas, NumPy, and TensorFlow. Through hands-on projects and mentorship from experienced professionals, interns sharpen their data manipulation, statistical analysis, and machine learning skills, preparing them for future roles in the field. • Exposure to Industry Practices: Internships provide interns with exposure to industry best practices, tools, and workflows commonly used in data science and data analytics roles. By working alongside seasoned professionals and collaborating on real-world projects, interns gain insights into how data-driven decisions are made within organizations, learning about data governance, project management, and ethical considerations in data science. • Contribution to Organizational Goals: Interns play a valuable role in contributing to organizational goals through their internship projects. Whether it's building predictive models to improve business forecasting, analyzing customer behavior to optimize marketing strategies, or automating data pipelines for enhanced efficiency, interns have the opportunity to make tangible contributions that drive business outcomes and add value to the organization. • Networking and Professional Development: Internships provide interns with networking opportunities and exposure to industry professionals, mentors, and peers. By building relationships, seeking mentorship, and participating in team collaborations, interns expand their professional network, gain valuable insights into career pathways, and lay the foundation for future opportunities in the field.
  • 16.
    16 2.5 Motivation/Scope ofInternship Undertaking a data science and data analytics internship is driven by various motivations that align with career aspirations, academic interests, and personal growth objectives. The scope of such an internship encompasses a range of learning opportunities, project engagements, and contributions to organizational goals. This section explores the motivation and scope of a data science and data analytics internship, shedding light on the rationale, objectives, and potential impact of the internship experience. Motivation: • Professional Growth: A primary motivation for pursuing a data science and data analytics internship is to foster professional growth and development. Interns seek to gain hands-on experience in working with real-world data sets, applying analytical techniques, and leveraging data-driven insights to solve business problems. This experience enhances their skills, expands their knowledge base, and prepares them for future roles in the field. • Skill Enhancement: Internships offer opportunities for interns to enhance their technical skills in areas such as programming languages (e.g., Python, R, SQL), statistical analysis, machine learning algorithms, data visualization, and big data technologies. By engaging in data manipulation, modeling, and interpretation tasks, interns strengthen their analytical capabilities and proficiency in utilizing data science tools and techniques. • Exploration of Career Pathways: Internships serve as a platform for interns to explore different career pathways within the field of data science and data analytics. Whether it's focusing on data engineering, machine learning, business intelligence, or data visualization, interns have the opportunity to gain exposure to diverse roles, industries, and applications of data science, helping them identify areas of interest and specialization. • Networking and Mentorship: Internships provide avenues for interns to network with professionals in the data science community and seek mentorship from experienced practitioners. By connecting with industry professionals, attending networking events, and participating in mentorship programs, interns can gain valuable insights, advice, and guidance that can inform their career decisions and trajectory in the field. • Contribution to Meaningful Projects: Interns are motivated by the opportunity to contribute to meaningful projects that have real-world impact and add value to the organization. Whether it's analyzing customer data to improve marketing strategies, building predictive models to optimize business processes, or conducting exploratory analysis to uncover insights, interns seek to make tangible contributions that drive business outcomes and innovation.
  • 17.
    17 Scope: • Learning Objectives:The scope of a data science and data analytics internship is defined by its learning objectives, which outline the specific knowledge, skills, and competencies that interns aim to acquire or develop during the internship period. These objectives may include mastering data manipulation techniques, understanding machine learning algorithms, or gaining proficiency in data visualization tools. • Project Engagement: Internship projects within the scope of data science and data analytics vary in complexity and focus, ranging from exploratory data analysis to predictive modeling and data-driven decision-making. The scope of the internship project defines the tasks, deliverables, and outcomes that interns are expected to achieve, guiding their activities and responsibilities throughout the internship. • Industry Focus: The scope of the internship may also be influenced by the industry or sector in which the internship is situated. Whether it's technology, finance, healthcare, e-commerce, or government, the industry focus of the internship determines the context, challenges, and opportunities encountered during the internship journey, shaping the scope of projects and applications of data science techniques. • Data Sets and Tools: Internship scope includes working with diverse data sets, ranging from structured databases to unstructured text and image data. Interns engage with a variety of data science tools and technologies, including programming languages (e.g., Python, R), statistical packages (e.g., Pandas, NumPy), machine learning libraries (e.g., scikit-learn, TensorFlow), and data visualization tools (e.g., Matplotlib, Tableau), depending on the scope of their projects and organizational requirements. • Collaboration and Communication: Interns collaborate with cross-functional teams, stakeholders, and mentors to accomplish project goals and deliverables within the scope of the internship. Effective communication, teamwork, and project management skills are essential for navigating project scope, addressing challenges, and ensuring alignment with organizational objectives.
  • 18.
    18 2.6 Methodological Details MethodologicalDetails on Python with Data Structures and Algorithms Python, as a versatile programming language, is widely used for implementing data structures and algorithms due to its simplicity, readability, and extensive library support. This section provides methodological details on how Python is utilized for implementing data structures and algorithms, including the choice of data structures, algorithm design principles, and practical considerations. 1. Choice of Data Structures: Python offers built-in and library-supported data structures that are essential for implementing algorithms efficiently. Commonly used data structures include: • Lists: Dynamic arrays that can store elements of different data types. • Tuples: Immutable sequences typically used for heterogeneous data. • Sets: Unordered collections of unique elements for fast membership testing. • Dictionaries: Key-value pairs for efficient data retrieval based on keys. • Arrays: Homogeneous collections of elements with fixed size for numerical computations. • Linked Lists: Linear data structures composed of nodes, each containing a data element and a reference to the next node. • Stacks and Queues: Abstract data types for Last-In-First-Out (LIFO) and First-In-First-Out (FIFO) operations, respectively. • Trees: Hierarchical data structures consisting of nodes connected by edges, with examples including binary trees, binary search trees, and heaps. • Graphs: Non-linear data structures composed of nodes and edges, representing relationships between entities. 2. Algorithm Design Principles: When implementing algorithms in Python, adhering to established design principles enhances code readability, performance, and maintainability. Key algorithm design principles include: • Divide and Conquer: Break down complex problems into smaller, more manageable subproblems, solve them recursively, and combine their solutions. • Dynamic Programming: Store solutions to overlapping subproblems in a table to avoid redundant computations in recursive algorithms. • Greedy Algorithms: Make locally optimal choices at each step with the hope of finding a globally optimal solution. • Backtracking: Systematically explore all possible solutions to a problem by recursively building candidates and rejecting those that fail to satisfy constraints.
  • 19.
    19 • Sorting andSearching: Utilize efficient sorting algorithms (e.g., Merge Sort, Quick Sort) and searching techniques (e.g., Binary Search) for data manipulation and retrieval tasks. 3. Practical Considerations: In addition to selecting appropriate data structures and algorithms, practical considerations ensure efficient implementation and optimization of Python code: • Time and Space Complexity Analysis: Analyze the time and space complexity of algorithms to assess their efficiency and scalability, guiding the selection of optimal solutions for specific use cases. • Code Optimization: Optimize Python code for performance by minimizing redundant computations, reducing memory overhead, and leveraging built-in functions and library modules. • Unit Testing: Validate the correctness and functionality of Python implementations through unit testing, ensuring that algorithms produce expected outputs for various inputs and edge cases. • Error Handling: Implement robust error handling mechanisms to handle exceptions, edge cases, and unexpected behaviors gracefully, enhancing the reliability and robustness of Python code. 4. Libraries and Frameworks: Python offers a rich ecosystem of libraries and frameworks for data structures and algorithms, including: 1. NumPy: For numerical computations and array manipulation. 2. Pandas: For data manipulation and analysis with tabular data structures. 3. SciPy: For scientific computing and optimization algorithms. 4. scikit-learn: For machine learning algorithms and model training. 5. NetworkX: For graph algorithms and network analysis. 6. itertools: For efficient iteration and combination of elements. heapq: For heap operations and priority queue implementations.
  • 20.
    20 2.7 Research Challenges Bigdata are huge data sets that are very complex. The data generated is highly dynamic and this further adds to its complexity. The raw data must be processed in order to extract value from it. This gives rise to challenges in processing big data and business issues associated with it. Volume of the data generated worldwide is growing exponentially. Almost all the industries such as healthcare, automobile, financial, transportation etc rely on this data for improving their business and strategies. For example, Airlines does millions of transactions per day and have established data warehouses to store data to take advantages of machine learning techniques to get the insight of data which would help in the business strategies. Public administration sector also uses information patterns from data generated from different age levels of population to increase the productivity. Also, many of the scientific fields have become data driven and probe into the knowledge discovered from these data. Cloud computing has been used as a standard solution for handling and processing big data. Despite all the advantages of integration between big data and cloud computing, there are several challenges in data transmission, data storage, data transformation, data quality, privacy, governance [14, 15]. Data Transmission Data sets are growing exponentially. Along with the size, the frequency at which these real-time data are transmitted over the communication networks has also increased. Healthcare professions exchange health information such as high-definition medical images that are transmitted electronically while some of the scientific applications may have to transmit terabytes of data files that may take longer to traverse the network. In case of streaming applications, the correct sequence of the actual data packets is as critical as the transmission speed. Cloud data stores are used for data storage however, network bandwidth, latency, throughput and security poses challenges. Data Acquisition and Storage Data acquisition is the process of collecting data from disparate sources, filtering, and cleansing data before it can be stored in any data warehouses or storage systems. While acquiring big data, the main characteristics that pose a challenge are the sheer volume, greater velocity, variety of the big data. This demands more adaptable gathering, filtering, and cleaning algorithms that ensure that data are acquired in more time-efficient manner. Data once acquired, needs to be stored in big capacity data stores which must provide access to these data in a reliable way. Currently there are Direct Attached Storage (DAS) and Network Attached Storage (NAS) storage technologies. Data Curation It refers to the active and ongoing management of data through its entire lifecycle from creation
  • 21.
    21 or ingestion towhen it is archived or becomes obsolete and is deleted. During this process, data passes through various phases of transformation to ensure that data is securely stored and is retrievable. Organizations must invest in right people and provide them with right tools to curate data. Such an investment in the data curation will lead to better quantification of high-quality data. Scalability Scalability refers to the ability to provide resources to meet business needs in an appropriate way. It is a planned level of capacity that can grow as needed. It is mainly manual and is static. Most of the big data systems must be elastic to handle data changes. At the platform level there is vertical and horizontal scalability. As the number of cloud users and data increases rapidly, it becomes a challenge to exponentially scale the cloud’s ability in order to provide storage and process too many individuals who are connected to the cloud at the same time. Elasticity It refers to the cloud’s ability to reduce operational cost while ensuring optimal performance regardless of computational workloads. Elasticity accommodates to data load variations using replication, migration and resizing techniques all in a real-time without service disruption. Most of these are manual instead being automated. Availability Availability refers to on demand availability of the systems to authorized users. One of the key aspects of cloud providers is to allow users to access one or more data services in short time. As the business models evolve, it would lead to rising demand for more real time system availability. Data integrity Data Integrity refers to modification of data only by the authorized user in order to prevent misuse. Cloud based applications does allow its users to store and manage their data in cloud data centres, however these applications must maintain data integrity. Since the users may not be able to physically access the data, the cloud should provide mechanisms to check for the integrity of data. Security and Privacy Maintaining the security of the data stored in the cloud is very important. Sensitive and personal information that is kept in the cloud should be defined as being for internal use only, not to be shared with third parties. This would be a major concern when providing personalized and location-based services as access to personal information are required to produce relevant results. Each operation such as transmitting data over network, interconnecting the systems over network or mapping virtual machines to their respective physical machines must be done in a secured way. Heterogeneity Big data is vast and diverse. Cloud computing systems need to deal with different formats structures, semi-structured and unstructured data coming from various sources. Documents, photos, audio, videos and other unstructured data can be difficult to search and analyse. Having to combine all the unstructured data and reconcile it so that it can be used to create reports can be
  • 22.
    22 incredibly difficult inreal time. Data Governance and Compliance Data governance specifies the exercise of control and authority over the way data needs to be handled and accountabilities of individuals when achieving business objectives. Data policies must be defined on the data format that needs to be stored, different constraint models that limits the access to underlying data. Defining the stable data policies in the face of increasing data size and demand for faster and better data management technology is not an easy task and its policies could lead to counter productiveness. Data Uploading It refers to the ease with which massive data sets can be uploaded to the cloud. Data is usually been uploaded through internet. The speed at which data is uploaded in turn depends network bandwidth and security. This calls for improvement and efficient data uploading algorithms to minimize upload times and provide a secure way to transfer data onto the cloud. Data Recovery It refers to the procedures and techniques by which the data can be reverted to its original state in scenarios such as data loss due to corruption or virus attack. Since periodic backups of petabytes of data is time consuming and costlier, it is necessary to identify a subset of data valuable to the organization for backup. If this subset of data is lost or corrupted, it take weeks to rebuild the lost data at these huge scales and result in more downtime for the users. Data Visualization Data Visualization is a quick and easy way to represent complex things graphically for better intuition and understanding. It needs to recognize various patterns and correlations hidden under massive data. Structured data can be represented in traditional graphical ways, whereas it is difficult to visualize high diversity, uncertain semi-structured and unstructured big data in real- time. In order to cope with such large dynamic data, there is a need for immense parallelization which is a challenge in visualization.
  • 23.
    23 2.8 Conclusion In thebig data era of innovation and competition driven by advancements in cloud computing has resulted in discovering hidden knowledge from the data. In this paper we have given an overview of big data applications in cloud computing and its challenges in storing, transformation, processing data and some good design principles which could lead to further research.
  • 24.
    24 2.9 References 1. Konstantinou,I., Angelou, E., Boumpouka, C., Tsoumakos, D., & Koziris, N. (2011, October). On the elasticity of nosql databases over cloud management platforms. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 2385-2388). ACM. 2. Labrinidis, Alexandros, and Hosagrahar V. Jagadish. "Challenges and opportunities with big data." Proceedings of the VLDB Endowment 5.12 (2012): 2032-2033. 3. Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull, 32(1), 3-12. 4. Luhn, H. P. (1958). A business intelligence system. IBM Journal of Research and Development, 2(4), 314-319 5. Sivarajah, Uthayasankar, et al. "Critical analysis of Big Data challenges and analytical methods." Journal of Business Research 70 (2017): 263-286. 6. https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-choose/ 7. Kavis, Michael J. Architecting the cloud: design decisions for cloud computing service models (SaaS, PaaS, and IaaS). John Wiley & Sons, 2014. 8. https://www.ripublication.com/ijaer17/ijaerv12n17_89.pdf 9. Sakr, S. & Gaber, M.M., 2014. Large Scale and big data: Processing and Management Auerbach, ed. 10. Ji, Changqing, et al. "Big data processing in cloud computing environments." 2012 12th international symposium on pervasive systems, algorithms and networks. IEEE, 2012. 11. Han, J., Haihong, E., Le, G., & Du, J. (2011, October). Survey on nosql database. In Pervasive Computing and Applications (ICPCA), 2011 6th International Conference on (pp. 363-366). IEEE. 12. Zhang, L. et al., 2013. Moving big data to the cloud. INFOCOM, 2013 Proceedings IEEE, pp.405–409 13. Fernández, Alberto, et al. "Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4.5 (2014): 380-409. 14. http://acme.able.cs.cmu.edu/pubs/uploads/pdf/IoTBD_2016_10.pdf
  • 25.
    25 15. Xiaofeng, Meng,and Chi Xiang. "Big data management: concepts, techniques and challenges [J]." Journal of computer research and development 1.98 (2013): 146-169.