Total Pageviews

Saturday 19 May 2018

Jupyter Notebook - how to enable Intellisense

At the top of your note book add this line

%config IPCompleter.greedy=True

Then when you have an object, for example numpy (np) do this
np.
After the . press [TAB] and it will show you all the methods available.

Friday 18 May 2018

Documentation

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

https://docs.scipy.org/doc/numpy/user/quickstart.html 

Matplotlib - Bar Chart








Load CSV in panda DataFrame






output : 



Data Science

Essential skills required for DS :

1. Extract and clean data using python/R
2. Analyse data using statistics
3. Present data using python ( numpy/pandas ) or tools like Tableau
4. Build predictive modles using machine learning algorithms

you should know :

1. pyhton
2. R
3. Statistics
4. Machine learning algorithms like Liner Regression , Logistical Regression etc
5. tools like Tableau

You can use platforms like Kaggle (https://www.kaggle.com/) to work on Data science projects .

https://www.analyticsvidhya.com/blog/2017/01/the-most-comprehensive-data-science-learning-plan-for-2017/ 


3.2: Basics of Mathematics and Statistics

Time suggested: 8 weeks (February 2017 – March 2017)
Topics to be covered:
  • Descriptive Statistics – 1 week
  • Probability – 2 weeks
  • Inferential Statistics – 2 weeks
  • Linear Algebra – 1 week
  • Structured Thinking – 2 weeks

Descriptive Statistics – 1 week


Probability – 2 weeks

Inferential Statistics – 2 weeks

  •  Course (mandatory) – Intro to Inferential Statistics from Udacity – Once you have gone through the descriptive statistics course, this course will take you through statistical modeling techniques and advanced statistics.
  •  Books (optional) – Online Stats Book – This online book can be used for a quick reference for inference tasks.

Linear Algebra – 1 week

  • Course (mandatory)
    • Linear Algebra – Khan Academy : This concise and an excellent course on Khan Academy will equip you with the skills necessary for Data Science and Machine Learning.
  • Books (optional)

Structured Thinking – 2 weeks

  •  Competitions (mandatory): No amount of theory can beat practice. This is a strategic thinking problem which will test you on your thinking process. Also, keep an eye on business case studies as they help in structuring your thoughts tremendously.

3.3: Introducing the tool – R / Python

Time suggested: 8 weeks (April 2017 – May 2017)
Topics to be covered:
  • Tools (R/Python) – 4 weeks
  • Exploration and Visualization (R/Python) – 4 weeks
  • Feature Selection/ Engineering

Tools

1. R

  • Books – R for Data Science – This is your one stop solution for referencing basic materials on R.
  • Blogs/Articles
    • This article will serve a great point for collating the entire process of model building starting from installation of RStudio/R.
    • R-bloggers – This is one of the most recommended blog for R- users. Every R practitioner should keep this blog bookmarked. It has some of the most effective and practical R tutorials. Bookmark it now.

2. Python

  • Books (mandatory) – Python for Data Analysis – This book covers various aspects of Data Science including loading data to manipulating, processing, cleaning and visualizing data. Must keep reference guide for Pandas users.

Exploration and Visualization

1. R

  • Course
    • Exploratory Data Analysis – This is an awesome course by Johns Hopkins University on Coursera. You will need no other course to perform visualization and exploratory work in R.
  • Blogs/Articles
    • Comprehensive guide to Data Exploration in R – This will be a one-stop article that I will suggest you to go through carefully and follow every step. This is because the steps mentioned in the article are the same steps you will be using while solving any data problem or a hackathon problem.
    • Cheat sheet – Data Exploration in R – This cheat sheet contains all the steps in data exploration with codes. I suggest you to take out a print and paste it on your wall for quick reference.

2. Python

  • Course (optional)
    • Intro to Data Analysis – This is an excellent course by Udacity on Data Exploration using Numpy and Pandas.
  • Books (optional) – Python for Data Analysis – A one stop solution for your Data Exploration and Visualization in Python.

Feature Selection/ Engineering

  • Books (optional) – Mastering Feature Engineering: This book is master piece to learn feature engineering. Not only will you learn how to implement feature engineering in a systematic way. You will also learn different methods involved in feature engineering.

3.4: Basic & Advanced machine learning tools

Time suggested: 12 weeks (June 2017 – August 2017)
Topics to be covered (June 2017 – July 2017):
  • Basic Machine Learning Algorithms.
    • Linear Regression
    • Logistic Regression
    • Decision Trees
    • KNN (K- Nearest Neighbours)
    • K-Means
    • Naïve Bayes
    • Dimensionality Reduction
  • Advanced algorithms (August 2017)
    • Random Forests
    • Dimensionality Reduction Techniques
    • Support Vector Machines
    • Gradient Boosting Machines
    • XGBOOST

Linear Regression

  • Course
    • Machine Learning by Andrew Ng – There is no better resource to learn Linear Regression than this course. It will give you a thorough understanding of linear regression and there is a reason why Andrew Ng is considered the rockstar of Machine Learning.
  • Blogs/Articles
    • This lesson out of PennState Stat 501 course outlines the main features of Linear Regression ranging from a simple definition of a Linear Regression to determining the goodness of fit of a regression line.
    • This is an excellent article with practical examples to explain Linear Regression with code.
  • Books
    • The Elements of Statistical Learning – This book is sometimes considered the holy grail of Machine Learning and Data Science. It explains Machine Learning concepts mathematically from a Statistics perspective.
    • Machine Learning with R – This is a book I personally use to have a brief understanding of Machine Learning algorithms along with their implementation code.
  • Practice
    • Black Friday – Like I already said – No amount of theory can beat practice. Here is a regression problem that you can try your hands on for a deeper understanding.

Logistic Regression

  • Course (mandatory)
    • Machine Learning by Andrew Ng– The week 3 of this course will give you a deeper understanding of the one of the most widely used classification algorithm.
    • Machine Learning: Classification – Week 1 and 2 of this practical oriented Specialization course using Python will satiate your knowledge thirst about Logistic Regression.
  • Books (optional)
    • Introduction to Statistical Learning – This is an excellent book with a quality content on Logistic Regression’s underlying assumptions, statistical nature and mathematical linkage.
  • Practice (mandatory)
    • Loan Prediction – This is an excellent competition to practice and test your new Logistic Regression skills to predict whether loan status for a person was approved or not.

Decision Trees

  • Course (mandatory)
  • Books (mandatory)
    • Introduction to Statistical Learning – Section 8.1 and 8.3 explain the basics of decision trees through theory and practical examples.
    • Machine Learning with R – Chapter 5 of this book provides you the best explanation of Machine Learning Algorithms available in the market. Here, the decision trees are explained in an extremely non-intimidating and easier style.
  • Practice (mandatory)
    • Loan Prediction – This is an excellent competition to practice and test your new Logistic Regression skills to predict whether loan status for a person was approved or not.

KNN (K- Nearest Neighbors)

  • Course (mandatory) 
    • Machine Learning – Clustering and Retrieval: Week 2 of this course progresses to k-nearest neighbors from 1-nearest neighbor and also describes the best ways to approximate the nearest neighbors. It explains all the concepts of KNN using python.

K-Means


Naive Bayes

  • Course
    • Intro to Machine LearningTake this course to see Naive Bayes in action. In this course, Sebastian Thrun has explained Naive Bayes in Simple English. 
  • Blog / Article
    • 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) : This article will take you through Naive Bayes algorithm in detail. In this guide, you will learn how Naive Bayes algorithm works, applications and many more. It will also give you hands-on knowledge of building a model using Naive Bayes.
    • Naive Bayes for Machine Learning : This is one of the most comprehensive articles I have come across. Go through this article to have a complete understanding of why naive bayes algorithm is important for machine learning.

Dimensionality Reduction


Random Forests


Gradient Boosting Machines

  • Presentation (mandatory): Here is an excellent presentation on GBM. It contains the prominent features of GBM and the advantages and disadvantages of using it to solve real-world problems. It is must see article for somebody trying to understand GBM.

XGBOOST

  • Blogs /Articles (mandatory)
    • Official Introduction XGBOOST – Read the documentation of hackathons winning algorithm. It is an improvement over GBM and is right now the most widely used algorithm for winning competitions.
    • Using XGBOOST in R – An excellent article on deploying XGBOOST in R using a practical problem at hand.
    • XGBOOST for applied Machine Learning – An article by Machine Learning Mastery to evaluate the performance of XGBOOST over other algorithms.

Support Vector Machines


3.5: Building your profile

Time suggested: 8 weeks (September 2017 – October 2017)
Topics to be covered:
  1. GitHub Profile Building
  2. Practice via competitions
  3. Discussion Portals

GitHub Profile Building (mandatory)

It is very important for a Data Scientist to have a GitHub profile to host all the codes of the project he/she has undertaken. Potential employers not only see what you have done, how you have coded and how frequently / how long you have been practicing data science.
Also, codes on GitHub open up avenues for open source projects which can highly boost your learning. If you don’t know how to use Git, you can learn from Git and GitHub on Udacity. This is one of the best and easy to learn course to manage the repositories through terminal.

Practice via competitions (mandatory)

Time and again, I have stressed on the fact that practice beats theory. Moreover coding in hackathons brings you closer to developing data products in real life for solving real world problems. Below are most popular platforms to participate in Data Science/ Machine Learning Competitions.
  1. Analytics Vidhya Datahack
  2. Kaggle competitions
  3. Crowd Analytix human layer

Discussion Forums (optional)

Discussions are a great way to learn in a peer-to-peer setup from finding an answer to a question you stuck to providing answers to someone else’s questions. Below are some of the discussion rich platforms which you should keep a tab on to clear your doubts.
  1. Analytics Vidhya Discussion Portal
  2. Kaggle Discussion
  3. StackExchange

3.6: Apply for Jobs & Internships

Time suggested: 8 weeks (November 2017 – December 2017)
Topics to be covered: Jobs / Internships
If you are here after diligently following the above steps, then you can be sure that you are ready for a Job / Internship position at any Data Science / Analytics or Machine Learning firms. But it becomes quite difficult to identify the right jobs. So, for the purpose of saving the trouble, I have created a list of portals which lists down Data Science/ Machine Learning jobs and Internships.
  1. Analytics Vidhya Job Portal
  2. Datajobs
  3. Kaggle Job portal
  4. Internshala
In order to prepare for these interviews, you should go through this Damn Good Hiring Guide