Knowledge is knowledge

Showing posts with label Data Science. Show all posts
Showing posts with label Data Science. Show all posts

Why Python is so popular?

Why Python is so popular?

let's discuss why python is so popular.

Python have a clear and readable syntax if you already know to program then you can learn it in very little time.

You require less code to do the same thing compared to other programming languages

It’s a grate starter language because of the huge global community and wealth of documentation.

It is useful for many situations, including data science, AI and machine learning, web development, and IoT devices like the Raspberry Pi.

Large Organizations like IBM, Wikipedia, Google, Yahoo!, CERN, NASA, Facebook, Amazon, Instagram, Spotify and Reddit use python heavily.

It is a powerful general-purpose programming language that can do a lot of things and can be applied to many different classes of problems.

It has a large standard library which provides support for many different tasks like; automation, web scraping, text processing, image processing, machine learning, and data analytics. 

For data science python have scientific computing libraries like Pandas, NumPy, SciPy, and Matplotlib. 

For artificial intelligence TensorFlow, PyTorch, Keras, and Scikit-learn. 

Most interesting it can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK). 

The Python language has a code of conduct executed by the Python Software Foundation 

that seeks to ensure safety and inclusion for all, both online and in-person python 

communities. 

There are communities like PyLadies which supports a lot of interested people.

PyLadies is an international mentorship group with a focus on helping more women become 

active participants and leaders in the Python open-source community.

Share:

Programing language for data science

Programing language for data science 

The one line answer of the question is depends on the work you want to do each language have its own strength and weaknesses and there is no one answer to the question, depends largely on needs, the problem you are solving, and for who you are solving but the most common languages in the field of data science as per my knowledge and also to start with are Python, R, and SQL.



Python is very famous because it is easy and a large no of things you can do with a little code and its syntax is also easy to read and many more. The language's array-oriented syntax of R makes it easy to translate from math to code this language is used especially by statisticians,mathematicians data analysts etc. and SQL is a language to deal with structured data effectively. 
But there are also some more with their own strengths and weaknesses the most popular of them are Scala , Java C++, and Julia. Javascript, PHP, Go, Ruby, and Visual Basics all have their own use cases. 
The language you use also depends on the company in which you are working and the project you are assigned.
Share:

Why the Big Data is very famous topic these days?


These days Big Data? is a very famous topic of discussion in the technical field.

In past we have data but we don't have sufficient technologies to deal with them so it is not so famous but at present, we have technologies like Hadoop and also we have high computation power which is the measure reason because of the world is looking towards this field with these technologies.



at present we are generating a huge amount of data from various sources like mobile phones, the internet, street cameras, etc. and many big companies realized the importance of data many years ago so they started to store the data due to this at present we have in result huge amount of data already stored and also data coming from different sources this huge amount of data is known as Big Data.

this huge amount of data is analyzed and hidden features of them and the trends are identified and used for the development of business, games, society, etc . for example Once a  company used recorded videos of several matches of basketball games and they analyzed them and they found the places in the ground from where the chances of scoring well are very large and then they improve their team performance on the basis of this and score very great scores in the further games.

So we can see from the above example that the analysis of data, results in making a very great difference



Hence this is the main reason why Big Data is a very famous topic these days. The big companies working in this field realized the importance of data or Big data and they are working in this field to extract useful insights from this so everyone with time feals the importance of data and hence it's a very famous topic.

also, skilled people are needed in this job and demand of these people are much so this is also a very important point which made the Big Data a trending topic.




Share:

What is Hadoop?

 

Traditionally in computation and processing the data you have to bring data to the computer and you require a program for the processing of the data then you bring data to the program and the processing is done by the program. In a big data cluster Larry and Sergey Brin came up with an idea they sliced the data into small pieces and they distribute this data between thousands of computers first it was hundreds but now thousands and now it's ten thousand. And then they send the program to all the computers in the cluster. and each computer process the data and then sends the result back and then the result is combined and the data processing will be done in very little time. 



The first process is known as the map or mapper process and the second one is known as a reduction process. A fairly simple process but turns out very useful to process a large amount of data. Only twice the number of servers and you have twice the efficiency. So the bottleneck of all the major social media companies. 

Yahoo then got on board . Yahoo hired someone named Doug Cutting who had been working on a clone or a copy of the Google big data architecture and now that's called Hadoop. And now the Hadoop becomes a very famous one there are hundreds of thousands of companies out there that have some kind o footprint in the big data world.








Share:

Qualities of Data Scientist


The impotent qualities of a data scientist are curiosity, extremely argumentative, and judgmental.

The most important one is curiosity because if you are not curious you would not know what to do with data. Judgmental because if you have not preconceived notions about things you would not know where to begin with and where to go.Argumentative because if you have this skill then you can argue on your results and you can modify them and now you learn from the data which leads you to a better result.



The other Qualities which a data scientist needs is the comfort and flexibility with some analytic platforms; some software, some computing platform, but that's secondary. The most important thing is curiosity and the ability to take positions. Once you have done that, once you've analyzed .then you've got some answers.



The last quality of a data scientist is the ability to tell a story. Once you have your analytics and your tabulations, now you should be able to tell a great story from it. because everything is worthless if you are not able to explain your findings. your findings will remain hidden, remain buried, nobody would know. Your position in this field largely depends on the ability to tell stories.

The starting point for acquiring the qualities or skills of a data scientist is to decide in which field you are interested and in which field you want to be a data scientist for example lets say you are gaining skills in IT  field and you want to be a data scientist in the health field then in the health field data scientist a different type of skills are required so first decide it and also what is your competitive advantage.



Your competitive advantage is not necessarily going to be your analytical skills .your competitive advantage is the understanding of any field in which you can far away from the other crowd maybe it's film, music, computers, art, etc. Once you figured out this then you can start acquiring your analytical skills. What platforms you have to learn and learn those platforms, those tools would be specific to the industry you are interested in. And then once you have got some proficiency in the tools, the next thing would be to apply your skills to real problems, and then tell the rest of the world what you can do with it.


Share:

The Report Structure


Q.what the report should contain if it is five to six pages or less?

Ans: in this case, the report is more to the point and presents a summary of key findings 

Q.what a long or detailed report contains?.

A detailed report contains arguments and contains details about relevant work, research methodology, data sources, and also intermediate findings along with the main results.



Important constituents of a report.

even if a report is small like 4 to 5 pages it should contain the cover page, table of contents, executive summary, detailed contents , acknowledgments, reference, and appendices( if needed).

Explanation of some important constituents of the report.

1.Cover page:  this page contains the title of the report, names of authors, their affiliations, and contacts, the name of the institutional publisher ( if any ), and the date of publication.

2.Table of contents(ToC): This page contains important topics of your whole report it helps to give a quick overview of what the report contains if the report is a big one then it helps a lot.

3.Abstract or an executive summary: Nothing is more powerful than explaining the crux of your arguments in three paragraphs or less. Of course, for large documents running a few hundred pages, the executive summary could be longer.

4.Introductory section: The introductory help the reader who is new to the topic this section helps them to be familiar with the subject

5.Methodology: this section introduces the research methods and data sources you used for the analysis. if you collected some new data then explain the data collection exercise in detail.

6. Results: in this section, you present your empirical findings. starting with descriptive statistics and illustrative graphics, you will move towards formally testing your hypothesis

7.Discussion section: where you craft your main arguments by building on results you have presented earlier. here you rely on your narrative to enable numbers to communicate your thesis to your readers. You refer the reader to the research question and the knowledge gaps you identified earlier. you highlight your findings to provide the ultimate missing piece to the puzzle.

8.Conclusions: Here you write your conclusions drawn from the report and the future goals of your findings.

9.Reference: Here you write the references for your report.

10.Acknowledgment: acknowledging supports of those who have enabled your work is always good.

11.Appendices: if needed



Share:

What is Data Mining? and steps in Data Mining

 What is Data Mining?

Basically, data mining is the whole process by which we process and extract the required information from the data by using various technics like machine learning, statistics, etc.

The process goes from various steps which are as follows:



1. Establishing Data Mining Goals

The steps in data mining require you to set up goals for the exercise. We have to identify the key questions for which we want to answer the cost and benefits of the exercise are also factors about which we have to think. Furthermore, we have to determine in advance the expected level of accuracy and usefulness of the result obtained from data mining. If money were no object you can throw as many funds as necessary to get the desired answer required however the cost benefits are always a measure feature in determining the key goals of data mining. the level of accuracy from the data mining also affects the cost and vise vers furthermore beyond a certain level of accuracy more accuracy not affect the result much. the cost-benefit trade-offs for the desired level of accuracy are important considerations for the data mining goals.


2. Selecting Data

The output of the data mining exercise largely depends upon the quality of data being used for example at times data is readily available for processing .for instance, retailers often possess large databases of customer purchases and demographics \.On the other hand, data may not be readily available for processing in this case we have to use data effectively, collect data by various activities like performing surveys, etc which affects the cost of the data mining exercise. So the type of data, its size, and frequency of data collection have a direct bearing on the cost of data mining exercise Therfor the type of data needed for the data mining exercise that can answer the data at a reasonable cost is critical.


3. Preprocessing Data

Preprocessing of data is an important part of data mining in this we deal with data to find errors in data or irrelevant data. Sometimes in relevant data, the information is missing. So in preprocessing stage you remove the irrelevant data and do flagging as such necessary in erroneous data. Sometimes human errors are also included. Data should be subject to checks to ensure integrity. Lastly, we have to develop a formal method of dealing with the missing data and determine whether the data are missing randomly or systematically.

if the data were missing randomly, a simple set of solutions would suffice .and However if the missing data miss in a systematic way we have to identify how it affects the result 


4. Transforming Data

After the relevant attributes of data are retained, the next step is to determine appropriate formate for them in which data must be stored we are focused on reducing the no of attributes needed to explain the phenomena. This may require transforming data, data reduction algorithms, such as Principal Component Analysis, can reduce any of the attributes without the loss of a significant amount of data, for example, we can group income from all sources of family in the term aggregate family income. 

 Often you need to transform variables from one type to another. It may be prudent to transform the continuous variable for income into a categorical variable where each record in the database is identified as low, medium, and high-income individual. This could help capture the non-linearities in the underlying behaviors.


5. Storing Data

The transformed data must be stored in the formate such as it becomes conducive for data mining. The data must be stored in a format such that it gives immediate and unrestricted read and writes facility to the data scientists. the new variables can be written easily and the searching algorithms for the data mining processes do not need to do unnecessary searching on different servers and also privacy, security, and safety all are important points in which we have to focus on the data storing steps.


7. Mining Data

The data mining process covers the steps of data analysis methods, including parametric and non-parametric methods and machine-learning algorithms. A good starting point for data mining is data visualization. Multidimensional views of the data using advanced graphing capabilities of data mining software are very helpful in developing a preliminary understanding of the trends hidden in the data set. 


8. Evaluating Mining Results

After results have been extracted from data mining we do a formal evaluation of results. Formal evaluation could include testing the predictive capabilities of the models on observed data to see how effective and efficient the algorithms have been in reproducing data. This is known as an in-sample forecast. then the results are shared with the key stakeholders for the feedback and later iterations are followed for the data mining to improve the process.

data mining and evaluating the results become an iterative process such that the analysts use better and improved algorithms to improve the quality of results generated in light of the feedback received from the key stakeholders.

                                                                                                                           --Reference IBM data                                                                                                                                      science professional                                                                                                                                      course at  Coursera

Share:

What is Big Data? and its v's

 

What is Big Data?


What is Big Data it is a very famous question in 2021. So what is it? On a daily basis there is a vast amount of data produced by us it is coming from various sources for example mobile phones, YouTube, google search, social media like Facebook, Instagram, sensors information, security cameras, sports data, etc. all this data is very large such that it is very difficult to handle with traditional technics.There's even a name for it: Big Data. By Ernst and Young big data is defined as : 








"Big Data refers to the dynamic, large and
disparate volumes of data being created by people, tools, and machines.
It requires new, innovative, and scalable technology to collect, host, and analytically
process the vast amount of data gathered in order to derive real-time business insights
that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder
value" 

What are 5 Vs of Big Data?

We have no fixed definition of big data but in all cases, certain elements are common such as velocity, volume, variety, veracity, and value. These are known as Vs of Big Data.


What is the Velocity of Big Data?: The velocity of big data is the speed at which data accumulates is known as the velocity of big data. Data is generated at a large speed in each second in a process that has no end in the entire world in different forms for example YouTube each action on the internet or in front of a camera generates data so you can think how large is the world and how much population it has and maximum peoples at this time are using digital devices and plays roll in data generation.



What is Volume of Big Data? : Volume of Big data is the scale by which data is generating or in the increase in the data stored. It is one of the most important v's of big data because volume is a term nearly related to big data sometimes the big data is defined on the basis of how much volume it has. At present time we can store a huge amount of data in small chips which leads us to the capability of storing such a huge amount of data which leads to the term big data.
  

What is the Variety of Big Data?: Variety of Big Data is the diversity of data like structured data in the form of tables in rows and columns in relational databases and unstructured data which is not organized in an organized way like in the form of emails social media, YouTube video's, blogs, business decision etc. it also reflects data from different sources like mobiles phones, machines, video, processes, security cameras, etc.



What is Veracity of Big Data ?: Veracity is is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity. Today we have a huge amount of data so it's a common question to ask whether the data is true, false and accurate, or not etc, But since the data is huge so we face trouble in struggling with them. And time to time modifications is made to deal with the problem.



What is the Value of Big Data?: Value of Big Data is the ability and need to turn data into the value it does not just profit or loss it is in any form like medical and social value which gives benefit to the customer. most of the time data scientists are dealing with data to extract value from them. It's a common answer if we have no value or benefit from the data then we are not dealing with this so the most important point we have discussed is we deal with data because we want to extract value from data. 



Share:

What is Data Science?


Let's Discuss first what is data? we can say data is any type of information may be stored or not but any type of information is known as data and the study of this data is known as data science. So It is the one-line answer to the question, What is Data Science?

What is Data Science? Data science is basically the study of data. The term data science is also defined by others in many forms like the size of data or some skills which are compulsory to be a data scientist but basically, data science is related to the study of data to extract meaningful information from the data.

When we deal with data we encounter many types of data which are maybe structured data(like in form of tables etc.) or unstructured data (from emails social media etc.) when we deal with this data we require different types of skills to deal with them efficiently here the terms statistics, programming skills, mathematics etc comes. 

At present time we have a vast amount of data coming from different sources every second like emails, social media, log files, patent information, sports data, sensor information, security cameras, etc this huge amount of data leads to the term big data.

 Present Scenario of Data Science

At present time the data science is a very popular term because we have a huge amount of data and we have the huge computing power and efficient algorithms to deal with them but most of the companies don't know how to deal with data effectively and they pay a huge amount to persons who are aware from the topic Data science. To be a data scientist essential skills are curiosity, extremely argumentative and judgmental, and also proficient in some tools to deal with data effectively. 

Share:

Who can become a data scientist?


There are many people who have doubts that who can become a data scientist what are the qualifications needed to become a data scientist like, skills, traits and all so its clear that data science is not a field like medical and engineering or like computer engineer, etc it is not like that to think in childhood that I want to become a data scientist or I want to do something in the field of data science.



5-6 years back there was no name like data science or no one knew about data science and anything like that it is an emerging field now and anyone who is working in it or measure of them have jumped in this field automatically with time or realized with time that they are interested in this field and want to do further work in this field so they are doing that many of them are from different background like statistics, art, science field, some are musicians or others who are initially tried to work in different fields and with time they realized that they are interested and want to make their career in this field.



So from the above discussion, we can say that there is not a fixed field or qualifications required to be a data scientist the only thing matters is your interest in this field, and also first you should know what is this field and why it is so famous now.

Further, if you are interested in this field and want to work in this field then certain qualities are required to be a data scientist you can read about them further here - Qualities needed of a data scientist 

Share: