Part 1: SQL, Python, R and data visualization
I recently graduated with a degree in chemical engineering and landed my first job as a data analyst at a technology company. I documented my trip here.from chemical engineering to data science. Since then, many who have spoken to students at my school about the change have expressed the same interest and doubts...
"How did you get from engineering to data science?"
This is exactly the question I asked myself: How can I dive? The same thought drove him forward and prompted him to start building the skills of a data scientist just over a year ago.
It certainly wasn't the lack of information that hampered the investigation. On the contrary, the deluge of resources to learn data science makes it difficult to separate the best resources from the average.
But first, let's understand...
Ah, that's a difficult question to answer that baffles hiring managers and interviewees alike. In fact, different companies define data science differently, making the term ambiguous and somewhat elusive. Some say it's about programming, some say it's about math, while others say it's about understanding data. It turns out that they are all somewhat correct. For me, the definition I agree with the most is this:
Data science is the interdisciplinary field that uses techniques and theories from mathematics, computer science and domain knowledge. [1]
This is what data science looks like in an image to me. I have blurred the boundaries between each knowledge segment to demonstrate my impression that knowledge from each of these areas is combined to form what is known as "data science".
In this series of blog posts, I want to highlight some of the courses I've taken on my journey along with their pros and cons. With this, I hope to help people who have been in my position to plan their journey to self-study in data science. These publications are:
- Part 1: Data Processing with SQL, Python and R (Here it is!)
- Part 2 — Math, Probability and Statistics
- Part 3: Basics of computer science
- Part 4 -Machine Learning (read here!)
In this post I will highlight how I found out about itdata processingnecessary knowledge of a data scientist. In order to process data, you usually have to learn how
- Extract data from a database using SQL (standard query language) and
- Clean, manipulate, analyze data (usually with Python and/or R)
- Visualize data effectively.
SQL is the language for communicating with a database that holds data. If data is treasure buried underground, SQL is the shovel to unearth the treasure's raw form. More specifically, it allows extracting information from one or a combination of several database tables.
There are many different "flavors" of SQL such as SQL Server, PostgreSQL, Oracle, MySQL and SQLite. They each differ slightly, but the syntax is still very similar and you don't have to worry about what kind of SQL you're learning.
To learn a language, first learn the words before combining them into sentences and then paragraphs. The same applies to SQL.
To learn the basics (the SQL words or phrases) I usedDatacamp (Introduction to SQL)and data search (SQL basics). (I'll talk more about Datacamp and Dataquest later.) These sites usually refresh basic SQL skills with instructive exercises and examples. Some of the concepts covered are:
- SELECT and WHERE to filter and select
- COUNT, SUM, MAX, GROUP BY, HAVING to aggregate data
- DISTINCT, COUNT DISTINCT to create different useful lists and different aggregates
- OUTER (e.g. LEFT) and INNER JOIN when/where to use them
- Strings and time conversions
- UNION and UNION OF ALL.
(You may not know that you know this, but that's okay! This is just a list of things you can expect to learn.)
However, the opportunity to do these exercises did not adequately prepare me as an analyst. He could understand words and sentences, but he was far from able to write a whole paragraph. In particular, some notable intermediate and advanced concepts such as subqueries and windowing are missing or not covered in detail, although they have been tested in numerous technical interviews and are essential to my current role as an analyst. These skills include
- Treatment of NULL with COALESCE
- Subqueries and their impact on query efficiency
- temporary tables
- The car is a
- Window functions such as PARTITION, LEAD, LAG
- Custom Functions
- Using indexes in queries to speed up operations.
To learn these skills, I focused primarily on usageSQLZoo.net,This is free and offers very challenging exercises for each concept. My favorite feature of SQLZoo is that it includes exercises that test different concepts in a built-in question. For example, you receive the following entity-relationship diagram and are asked to build complex queries based on it.
This is similar to what we find at work as an analyst: we use different techniques that we learn to extract information from the same database. The following is the entity relationship diagram from SQLZoo's questioncounseling center'. Accordingly, you are askedView the manager and number of incoming calls for each hour of the day on 08/12/2017. (Try it yourselfHere!)
Other resources I've used includeSQL Questions by Zachary ThomasjLeetcode.
To start learning the programming and tools you need for data science, you can't run away from R and/or Python. They are very popular programming languages used for data manipulation, visualization and wrangling. the question orR vs Pythonis an old question that deserves its own post. It's my turn?
It doesn't matter if you choose R or Python: once you master one, you can easily choose the other.
My journey with programming in Python and R started with coding websites like CodeAcademy, Datacamp, Dataquest, SoloLearn and Udemy. These websites offer customized courses organized by language or packages. Each breaks down the concepts into digestible chunks and gives the user the starting code to fill in the blanks. These pages usually walk you through a simple demonstration, and then you have the opportunity to practice the concept through exercises right after. Some then offer project-based exercises.
Today I'm going to focus on two of my favorites, Datacamp and Dataquest.
Please note that below you will find an affiliate link for the courses. This means nothing to you as the price is the same, but I do get a small commission if you decide to make a purchase.
data field
DataCamp offers video courses taught by experts in the field and exercises to fill in the blanks. Video conferences are mostly concise and efficient.
One part I love about DataCamp is the up-to-date courses organized by career path in SQL, R and Python.This makes planning your study plan easier – now all you have to do is follow your interest path.some of the streetscontain:
- Data Science in Python/R
- Data Analyst in Python/R/SQL
- R-Statistics
- Machine Learning Scientists in Python/R
- Python/R programmer
Personally, I started my R training withData Science in R, which provided a very detailed introduction to sorting in R, a collection of incredibly useful data packages for organizing, manipulating, and visualizing data, specifically including ggplot2 (for data visualization), dplyr (for data manipulation), and stringr (for string manipulation).
However, I have my complaints about DataCamp - it's poor retention of information after DataCamp is complete. With the format of filling in the gaps, it's easy to guess what's needed in the gap without really understanding the concept. As a student on the platform, I tried to take as many courses as possible in the shortest possible time. I skimmed the code and filled in the blanks without understanding the big picture. If I could restart my DataCamp learning all over again, it would take my time to better digest and understand the code as a whole, not just the parts I was supposed to complete.
data search
Dataquest is very similar to DataCamp. It focuses on using code exercises to illuminate programming concepts. Like Datacamp, it offers a wide range of courses in R, Python, and SQL, albeit a little less extensive than DataCamp's. However, unlike Datacamp, Dataquest does not offer video conferencing, for example.
Some of the leads Dataquest offers include:
- Data Analyst in R/Python
- Data Science in Python
- data technology
DataQuest content is generally more difficult than DataCamp content. There was also less formatting practice to fill in the gaps. Although it took longer, my retention of knowledge was better on DataQuest.
Another great feature of DataQuest is the monthly mentor call, which reviews your resume and provides technical guidance. Although I didn't personally approach a mentor, I would have done so in hindsight as it would definitely help me progress much faster.
Data visualization is key to showcasing the insights you've gleaned from your data. After learning the technical skills of graphing using Python and R, I learned the principles of data visualization from a book, Storytelling with Data by Cole Knaflic.
This book is platform independent. In other words, it doesn't focus on any specific software, but instead teaches the general principles of data visualization with insightful examples. Some of the key points you can learn from this book are:
- understand the context
- Choose an effective image
- remove the clutter
- Grab attention wherever you want
- Think like a designer
- To tell a story
I thought I knew something about data visualization until I read this book.
After digesting the book, I was able to create a (somewhat) visually appealing graphic dealing with police brutality against black people. One of the most important lessons learned from the book applied here wasdraw attention to yourself wherever you want.To do this, the Afro-American line was highlighted with a bright yellow, reminiscent of the BLM color, and made sure that the rest of the graphic faded into the background with softer tones such as white and gray.
In this post, I've covered the steps I took to learn to code from scratch. With these courses, you already have the skills to manipulate data! However, there is still a long way to go. I will report on this in future posts
- Part 2 -Math, Probability and Statistics
- Part 3: Basics of computer science
- Part 4-machine learning
- Part 5 —Create your first machine learning project
If you have any questions, feel free to contact me on LinkedIn. all the best and good luck!
If you enjoyed this blog post, feel free to read my other articles on machine learning:
- How to Become a Data Analyst: Data Visualization with Google Data Studio
- What makes a great wine...great? (Using Machine Learning and Partial Dependence Plot in Finding Good Wine)
- Interpreting black box ML models with LIME(Understanding LIME visually by modeling breast cancer data)
- This article wastranslated into RussianThanks toDenis Jurtschenko.
[1] Dhar, V. (2013)."Data Science and Forecasting".ACM Communication.56(12): 64–73.doi:10.1145/2500499.S2CID 6107147.filedExtracted from the original on November 9, 2014. Accessed September 2, 2015.
FAQs
What is the easiest way to learn data science? ›
The best way to learn data science is to work on projects so you can gain data science skills that can be applied immediately and are useful from a real-world implementation perspective. The sooner you start working on diverse data science projects, the faster you will learn the related concepts.
How to self learn data science in 2022? ›- Why Project Based Approach?
- Skillset - Business Knowledge.
- Skillset - Statistics (Experiment Design)
- Skillset - SQL.
- Skillset - Python (Pandas)
- Skillset - Statistics (Descriptive Statistics)
- Skillset - Data Visualization.
- Skillset - Machine Learning.
Becoming a data scientist in six months is possible if you have a strong background in mathematics and coding. If you are one such candidate, follow the steps below: Download simple datasets and perform Exploratory Data Analysis on them.
Is 3 months enough for data science? ›In conclusion, I would say that it is hard to become a Data Scientist, especially in three months. This is because: Some Bootcamp is not qualified enough to teach you the necessary data science skills. Not every student are talented enough to catch up with the learning material in a short time.
How many hours a day do you need to study data science? ›While undergraduate and master's courses in colleges and universities often taken 2-3 years to teach you all the above, many say you can learn them in about 6 months by dedicating around 6-7 hours every day.
Can I learn data science at 40? ›So despite industry ageism, a recent study by Zippia showed that the average age of data analysts in the U.S. is 43 years old. This takes us back to our titular question: are you too old to start a new career in data analytics? The short answer, in our opinion, is no.
How do I start learning data science from scratch? ›- Learn Programming Language.
- Step 2- Learn Math & Statistics.
- Step 3- Learn Data Science Libraries.
- Step 4- Learn SQL Skills.
- Step 5- Learn Data Visualization.
- Step 6- Learn Machine Learning Algorithms.
- Step 7- Take Part in Data Science Competitions.
On average, to a person with no prior coding experience and/or mathematical background, it takes from 7 to 12 months of intensive studies to become an entry-level data scientist. It is important to keep in mind that learning only the theoretical basis of data science may not make you a real data scientist.
Can a non IT guy learn data science? ›Data Science is only for persons with an IT background. It is a persistent myth that many people believe. Although it is true that some IT professionals seek to advance their skills in analytics, this field is not only open to people with a background in programming and IT.
Can I learn data science on my own for free? ›An online learning platform, freeCodeCamp is another best place to learn Data Science for free. They offer free lessons on statistics for Data Science, computer science concepts, Python fundamentals, Pandas, Python Matplotlib, and even a guide to build a good Data Science portfolio.
Can an average person learn data science? ›
Many students at all levels want to take part in data science. Thanks to communication tools, there are lots of ways to learn data science. You can attend online courses from your home and become a data scientist. Compared to university expenses, it is very cheap to have a profession with these courses.
Is data science hard for beginners? ›Data science is a difficult field. There are many reasons for this, but the most important one is that it requires a broad set of skills and knowledge. The core elements of data science are math, statistics, and computer science. The math side includes linear algebra, probability theory, and statistics theory.